Kseniya Parkhamchuk
Back to "Shape of everything inside a transformer"

MLP layer and model outputs

MLP layer dimension values are partially covered inside Attention block description (normalization layer, residual connections). I will duplicate them here, just for the whole picture.

ValueDimensionalityReasoningExample
Normalization layer output[BATCH_SIZE, SEQ_LEN, EMB_DIM]Dimensionality is not changing. The purpose of layer - embeddings normalization
example 1
β, γ - normalization layer parameters[EMB_DIM]1D tensors of the embedding dimensionality size for effective multiplication and summation with embeddings
example 2
Each vector position value is scaled by the related position in the γ vector
Mean and variance inside normalisation layer[BATCH_SIZE, SEQ_LEN, 1]Custom mean and variance for each token embedding (but the same mean and variance is applied to all dimensions of one token)
example 3

Mean and var will be applied to every dimension of a current token
Weight matrix[EMB_DIM, N * EMB_DIM]Weight matrix has N time expanded hidden dimension. More dimensions allow more patterns to be formed and boosts models capacity
Weight matrix
Linear layer output (x * Weight matrix)[BATCH_SIZE, SEQ_LEN, N * EMB_DIM]Linear layer enlarges the embeddings dimensionality N times to make the network deeper (N-expansion factor)
linear layer output
MLP projection matrix[N * EMB_DIM, EMB_DIM]The first dimension is N expanded dimension to match the Linear layer output and to project back to the initial shape
MLP projection matrix
MLP projection output[BATCH_SIZE, SEQ_LEN, EMB_DIM]Projects dimensionality back to EMB_DIM
MLP projection-output
Residual connections[BATCH_SIZE, SEQ_LEN, EMB_DIM]MLP output + MLP input

Model output

Inputs and outputs of the normalization layer are present in both attention and MLP blocks I described before.

ValueDimensionalityReasoningExample
Output projection (linear layer)[EMB_DIM, VOCAB_SIZE]The final output of the model should be a probability distribution over the entire vocabulary of tokens for each position. This matrix maps the EMB_DIM at each position to logits over the vocabulary.
MLP projection-output
Logits[BATCH_SIZE, SEQ_LEN, VOCAB_SIZE]Raw scores over the vocabulary for each token position, to be converted into a probability distribution using softmax
MLP projection-output

And this is all about shapes inside the LLMs like GPT-2. A couple of notices I want to add:

  1. The shapes of gradients match those of the parameters. For example:

Token embeddings parameters = [VOCAB_SIZE, EMB_DIM]
Gradients = [VOCAB_SIZE, EMB_DIM]

W_q, W_k, W_v = [EMB_DIM, HEAD_DIM]
Gradients = [EMB_DIM, HEAD_DIM]

... and so on

  1. Loss is a scalar value.

  2. Different architectural modifications are possible in SOTA models.

Diving deeper into the dimensionality flow in LLMs helps to trace how information is transformed at each stage. These insights build stronger intuition around the architectural choices and future improvements.