Part 3: MLP layer and model outputs | Shape of everything inside a transformer

Back to "Shape of everything inside a transformer"

MLP layer and model outputs

MLP layer dimension values are partially covered inside Attention block description (normalization layer, residual connections). I will duplicate them here, just for the whole picture.

Value	Dimensionality	Reasoning	Example
Normalization layer output	[BATCH_SIZE, SEQ_LEN, EMB_DIM]	Dimensionality is not changing. The purpose of layer - embeddings normalization	Image loading is in progress...
β, γ - normalization layer parameters	[EMB_DIM]	1D tensors of the embedding dimensionality size for effective multiplication and summation with embeddings	Image loading is in progress... Each vector position value is scaled by the related position in the γ vector
Mean and variance inside normalisation layer	[BATCH_SIZE, SEQ_LEN, 1]	Custom mean and variance for each token embedding (but the same mean and variance is applied to all dimensions of one token)	Image loading is in progress... Mean and var will be applied to every dimension of a current token
Weight matrix	[EMB_DIM, N * EMB_DIM]	Weight matrix has N time expanded hidden dimension. More dimensions allow more patterns to be formed and boosts models capacity	Image loading is in progress...
*Linear layer output (x Weight matrix)**	[BATCH_SIZE, SEQ_LEN, N * EMB_DIM]	Linear layer enlarges the embeddings dimensionality N times to make the network deeper (N-expansion factor)	Image loading is in progress...
MLP projection matrix	[N * EMB_DIM, EMB_DIM]	The first dimension is N expanded dimension to match the Linear layer output and to project back to the initial shape	Image loading is in progress...
MLP projection output	[BATCH_SIZE, SEQ_LEN, EMB_DIM]	Projects dimensionality back to EMB_DIM	Image loading is in progress...
Residual connections	[BATCH_SIZE, SEQ_LEN, EMB_DIM]	MLP output + MLP input

Model output

Inputs and outputs of the normalization layer are present in both attention and MLP blocks I described before.

Value	Dimensionality	Reasoning	Example
Output projection (linear layer)	[EMB_DIM, VOCAB_SIZE]	The final output of the model should be a probability distribution over the entire vocabulary of tokens for each position. This matrix maps the EMB_DIM at each position to logits over the vocabulary.	Image loading is in progress...
Logits	[BATCH_SIZE, SEQ_LEN, VOCAB_SIZE]	Raw scores over the vocabulary for each token position, to be converted into a probability distribution using softmax	Image loading is in progress...

And this is all about shapes inside the LLMs like GPT-2. A couple of notices I want to add:

The shapes of gradients match those of the parameters. For example:


Token embeddings parameters = [VOCAB_SIZE, EMB_DIM]
Gradients = [VOCAB_SIZE, EMB_DIM]

W_q, W_k, W_v = [EMB_DIM, HEAD_DIM]
Gradients = [EMB_DIM, HEAD_DIM]

... and so on

Loss is a scalar value.
Different architectural modifications are possible in SOTA models.

Diving deeper into the dimensionality flow in LLMs helps to trace how information is transformed at each stage. These insights build stronger intuition around the architectural choices and future improvements.