HEAD_DIM – dimension of attention head (EMB_DIM/N_HEADS)
N_HEADS - the number of heads in the attention layer according to architecture choice (hyperparameter)
Below is a table with dimensions. This table is true for training mode.
Value | Dimensionality | Reasoning | Example |
---|---|---|---|
Normalisation layer output | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | Dimensionality is not changing. The purpose of layer - embeddings normalisation | ![]() |
β, γ - normalisation layer parameters | [EMB_DIM] | 1D tensors of the embedding dimensionality size for effective multiplication and summation with embeddings | ![]() |
Mean and variance inside normalisation layer | [BATCH_SIZE, SEQ_LEN, 1] | Custom mean and variance for each token embedding (but the same mean and variance is applied to all dimensions of one token) | ![]() Mean and var will be applied to every dimension of a current token |
Dimensions inside the attention mechanism are calculated in scope of 1 attention head
Value | Dimensionality | Reasoning | Example |
---|---|---|---|
W_q, W_k, W_v (query, key, value tensors) - learnt parameters | [EMB_DIM, HEAD_DIM] | By projecting into W matrices, the initial dimensionality (EMB_DIM) is reduced to HEAD_DIM. [EMB_DIM, HEAD_DIM] shape makes it possible | ![]() |
Q, K, V matrices | [BATCH_SIZE, SEQ_LEN, HEAD_DIM] | The output after projecting into W_q, W_k, W_v matrices. The embedding dimensionality is reduced to allow each attention head to focus on separate features | ![]() |
Attention scores (Q x K(transposed)) | [BATCH_SIZE, SEQ_LEN, SEQ_LEN] | Q x K_T (transposed to meet matrix multiplication rules). Point on token relevance to each other | ![]() |
Attention head output (Attn. weights x V) | [BATCH_SIZE, SEQ_LEN, HEAD_DIM] | Attn_weights x V = [BATCH_SIZE, SEQ_LEN, SEQ_LEN] x [BATCH_SIZE, SEQ_LEN, HEAD_DIM]. According to matrix multiplication rules, SEQ_LEN dimension is reduced to HEAD_DIM | ![]() |
Attention layer concatenation output (concatenation of all heads) | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | All attention outputs are concatenated along the last dimension to get the initial embedding size dimensionality | ![]() |
Final projection | [EMB_DIM, EMB_DIM] | To preserve the dimensions of attention output. To mix learnt features from attention heads | ![]() |
Bias during the final projection | [EMB_DIM] | Should be aligned with a projection matrix dimensionality. Applied to each position of the projection matrix | - |
Attention layer output | [BATCH_SIZE, SEQ_LEN, EMB_DIM] | projection x attention_output matrix multiplication rules applied | ![]() |