Kseniya Parkhamchuk
Back to "Shape of everything inside a transformer"

Tokens and embeddings

Feel free to check the input transformation visualization to remind you of the flow.

I also like to come back to this resource as it may help to visualize the whole architecture.

Below is a table of all possible values you might have dimensionality doubts in. There is also a vocabulary of definitions to prevent some misunderstandings.

SEQ_LEN - maximum length of input sequence (the longer sequence will be truncated, the shorter – filled with the < PAD > token). Has a fixed length during training.

BATCH_SIZE - number of sequences of length SEQ_LEN which are going through one step of training at once.

VOCAB_SIZE - defines how many tokens your model can identify. Think of it as a number of words the dectionary contains, and the one can look for (modern models usually contain ~ 30 000 - 100 000 tokens).

EMB_DIM - hyperparameter defined by the architect that best suits model size and depth.

Below is a table with dimensions. This table is true for training mode.

ValueDimensionalityReasoningExample
Data chunks from dataloader[BATCH_SIZE, SEQ_LEN]Reducing batch size results in less memory usage. Sequence length is a hyperperameter that defines how many tokens model can "remember"
batch = [
[1, 2, 3, 4, 5],
[11, 36, 47, 53, 15],
[23, 123, 5, 2, 55]
]
BATCH_SIZE = 3
SEQ_LEN = 5
Shape: [3, 5]
Inputs (sequence of tokens while training)[BATCH_SIZE, SEQ_LEN - 1]We're shifting the sequence by 1 to create input-target pairs for next-token prediction
inputs = batch[:, :-1]
inputs = [
[1, 2, 3, 4],
[11, 36, 47, 53]
[23, 123, 5, 2]
]
BATCH_SIZE = 3
SEQ_LEN = 4
Shape: = [3, 4]
Targets (model predictions)[BATCH_SIZE, SEQ_LEN - 1]The shape is the same as for the inputs. Targets is the same sequence as inputs, but shifted by 1 for being a step ahead to represent the predictions
targets = batch[:, 1:]
targets = [
[2, 3, 4, 5]
[36, 47, 53, 15]
[123,5,2,55]
]
Shape: = [3, 4]
Token embeddings[VOCAB_SIZE, EMB_DIM]Stores vectors of EMB_DIM dimensionality. Each vector represent a token in the vocabulary. Embeddings are learnable parameters
token embeddings
Position embeddings[SEQ_LEN, EMB_DIM]Stores information about token position in a sequence. Not learnable parameters (usually). Are calculated using the sinusoidal formula
position embeddings
Input embeddings (the result of embedding layer)[BATCH_SIZE, SEQ_LEN, EMB_DIM]Adds embedding dimensionality to existing batch_size x seq_len of inputs. Make tokens look like vectors
input embeddings

That's all regarding token sequences and embeddings.