Kseniya Parkhamchuk
Back to "Shape of everything inside a transformer"

Normalisation and Attention

HEAD_DIM – dimension of attention head (EMB_DIM/N_HEADS)

N_HEADS - the number of heads in the attention layer according to architecture choice (hyperparameter)

Below is a table with dimensions. This table is true for training mode.

ValueDimensionalityReasoningExample
Normalisation layer output[BATCH_SIZE, SEQ_LEN, EMB_DIM]Dimensionality is not changing. The purpose of layer - embeddings normalisation
example 1
β, γ - normalisation layer parameters[EMB_DIM]1D tensors of the embedding dimensionality size for effective multiplication and summation with embeddings
example 2
Each vector position value is scaled by the related position in the γ vector
Mean and variance inside normalisation layer[BATCH_SIZE, SEQ_LEN, 1]Custom mean and variance for each token embedding (but the same mean and variance is applied to all dimensions of one token)
example 3

Mean and var will be applied to every dimension of a current token

Attention block

Dimensions inside the attention mechanism are calculated in scope of 1 attention head

ValueDimensionalityReasoningExample
W_q, W_k, W_v (query, key, value tensors) - learnt parameters[EMB_DIM, HEAD_DIM]By projecting into W matrices, the initial dimensionality (EMB_DIM) is reduced to HEAD_DIM. [EMB_DIM, HEAD_DIM] shape makes it possible
example 4
Q, K, V matrices[BATCH_SIZE, SEQ_LEN, HEAD_DIM]The output after projecting into W_q, W_k, W_v matrices. The embedding dimensionality is reduced to allow each attention head to focus on separate features
example 5
Attention scores (Q x K(transposed))[BATCH_SIZE, SEQ_LEN, SEQ_LEN]Q x K_T (transposed to meet matrix multiplication rules). Point on token relevance to each other
example 6
Attention head output (Attn. weights x V)[BATCH_SIZE, SEQ_LEN, HEAD_DIM]Attn_weights x V = [BATCH_SIZE, SEQ_LEN, SEQ_LEN] x [BATCH_SIZE, SEQ_LEN, HEAD_DIM]. According to matrix multiplication rules, SEQ_LEN dimension is reduced to HEAD_DIM
example 7
Attention layer concatenation output (concatenation of all heads)[BATCH_SIZE, SEQ_LEN, EMB_DIM]All attention outputs are concatenated along the last dimension to get the initial embedding size dimensionality
example 8
Final projection[EMB_DIM, EMB_DIM]To preserve the dimensions of attention output. To mix learnt features from attention heads
final projection matrix
Bias during the final projection[EMB_DIM]Should be aligned with a projection matrix dimensionality. Applied to each position of the projection matrix-
Attention layer output[BATCH_SIZE, SEQ_LEN, EMB_DIM]projection x attention_output matrix multiplication rules applied
example 9
Part 2: Normalisation and Attention | Shape of everything inside a transformer