Feel free to check the input transformation visualization to remind you of the flow.
I also like to come back to this resource as it may help to visualize the whole architecture.
Below is a table of all possible values you might have dimensionality doubts in. There is also a vocabulary of definitions to prevent some misunderstandings.
SEQ_LEN - maximum length of input sequence (the longer sequence will be truncated, the shorter – filled with the < PAD > token). Has a fixed length during training.
BATCH_SIZE - number of sequences of length SEQ_LEN which are going through one step of training at once.
VOCAB_SIZE - defines how many tokens your model can identify. Think of it as a number of words the dectionary contains, and the one can look for (modern models usually contain ~ 30 000 - 100 000 tokens).
EMB_DIM - hyperparameter defined by the architect that best suits model size and depth.
Below is a table with dimensions. This table is true for training mode.
Value | Dimensionality | Reasoning | Example |
---|---|---|---|
Data chunks from dataloader | [BATCH_SIZE, SEQ_LEN] | Reducing batch size results in less memory usage. Sequence length is a hyperperameter that defines how many tokens model can "remember" |
|
Inputs (sequence of tokens while training) | [BATCH_SIZE, SEQ_LEN - 1] | We're shifting the sequence by 1 to create input-target pairs for next-token prediction |
|
Targets (model predictions) | [BATCH_SIZE, SEQ_LEN - 1] | The shape is the same as for the inputs. Targets is the same sequence as inputs, but shifted by 1 for being a step ahead to represent the predictions |
|
Token embeddings | [VOCAB_SIZE, EMB_DIM] | Stores vectors of EMB_DIM dimensionality. Each vector represent a token in the vocabulary. Embeddings are learnable parameters | ![]() |
Position embeddings | [SEQ_LEN, EMB_DIM] | Stores information about token position in a sequence. Not learnable parameters (usually). Are calculated using the sinusoidal formula | ![]() |
Input embeddings (the result of embedding layer) | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | Adds embedding dimensionality to existing batch_size x seq_len of inputs. Make tokens look like vectors | ![]() |
That's all regarding token sequences and embeddings.