MLP layer dimension values are partially covered inside Attention block description (normalization layer, residual connections). I will duplicate them here, just for the whole picture.
Value | Dimensionality | Reasoning | Example |
---|---|---|---|
Normalization layer output | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | Dimensionality is not changing. The purpose of layer - embeddings normalization | ![]() |
β, γ - normalization layer parameters | [EMB_DIM] | 1D tensors of the embedding dimensionality size for effective multiplication and summation with embeddings | ![]() |
Mean and variance inside normalisation layer | [BATCH_SIZE, SEQ_LEN, 1] | Custom mean and variance for each token embedding (but the same mean and variance is applied to all dimensions of one token) | ![]() Mean and var will be applied to every dimension of a current token |
Weight matrix | [EMB_DIM, N * EMB_DIM] | Weight matrix has N time expanded hidden dimension. More dimensions allow more patterns to be formed and boosts models capacity | ![]() |
Linear layer output (x * Weight matrix) | [BATCH_SIZE, SEQ_LEN, N * EMB_DIM] | Linear layer enlarges the embeddings dimensionality N times to make the network deeper (N-expansion factor) | ![]() |
MLP projection matrix | [N * EMB_DIM, EMB_DIM] | The first dimension is N expanded dimension to match the Linear layer output and to project back to the initial shape | ![]() |
MLP projection output | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | Projects dimensionality back to EMB_DIM | ![]() |
Residual connections | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | MLP output + MLP input |
Inputs and outputs of the normalization layer are present in both attention and MLP blocks I described before.
Value | Dimensionality | Reasoning | Example |
---|---|---|---|
Output projection (linear layer) | [EMB_DIM, VOCAB_SIZE] | The final output of the model should be a probability distribution over the entire vocabulary of tokens for each position. This matrix maps the EMB_DIM at each position to logits over the vocabulary. | ![]() |
Logits | [BATCH_SIZE, SEQ_LEN, VOCAB_SIZE] | Raw scores over the vocabulary for each token position, to be converted into a probability distribution using softmax | ![]() |
And this is all about shapes inside the LLMs like GPT-2. A couple of notices I want to add:
Token embeddings parameters = [VOCAB_SIZE, EMB_DIM]
Gradients = [VOCAB_SIZE, EMB_DIM]
W_q, W_k, W_v = [EMB_DIM, HEAD_DIM]
Gradients = [EMB_DIM, HEAD_DIM]
... and so on
Loss is a scalar value.
Different architectural modifications are possible in SOTA models.
Diving deeper into the dimensionality flow in LLMs helps to trace how information is transformed at each stage. These insights build stronger intuition around the architectural choices and future improvements.