Part 1: Tokens and embeddings | Shape of everything inside a transformer

Back to "Shape of everything inside a transformer"

Tokens and embeddings

Feel free to check the input transformation visualization to remind you of the flow.

I also like to come back to this resource as it may help to visualize the whole architecture.

Below is a table of all possible values you might have dimensionality doubts in. There is also a vocabulary of definitions to prevent some misunderstandings.

SEQ_LEN - maximum length of input sequence (the longer sequence will be truncated, the shorter – filled with the < PAD > token). Has a fixed length during training.

BATCH_SIZE - number of sequences of length SEQ_LEN which are going through one step of training at once.

VOCAB_SIZE - defines how many tokens your model can identify. Think of it as a number of words the dectionary contains, and the one can look for (modern models usually contain ~ 30 000 - 100 000 tokens).

EMB_DIM - hyperparameter defined by the architect that best suits model size and depth.

Below is a table with dimensions. This table is true for training mode.

Value	Dimensionality	Reasoning	Example
Data chunks from dataloader	[BATCH_SIZE, SEQ_LEN]	Reducing batch size results in less memory usage. Sequence length is a hyperperameter that defines how many tokens model can "remember"	`batch = [ [1, 2, 3, 4, 5], [11, 36, 47, 53, 15], [23, 123, 5, 2, 55] ] BATCH_SIZE = 3 SEQ_LEN = 5 Shape: [3, 5]`
Inputs (sequence of tokens while training)	[BATCH_SIZE, SEQ_LEN - 1]	We're shifting the sequence by 1 to create input-target pairs for next-token prediction	`inputs = batch[:, :-1] inputs = [ [1, 2, 3, 4], [11, 36, 47, 53] [23, 123, 5, 2] ] BATCH_SIZE = 3 SEQ_LEN = 4 Shape: = [3, 4]`
Targets (model predictions)	[BATCH_SIZE, SEQ_LEN - 1]	The shape is the same as for the inputs. Targets is the same sequence as inputs, but shifted by 1 for being a step ahead to represent the predictions	`targets = batch[:, 1:] targets = [ [2, 3, 4, 5] [36, 47, 53, 15] [123,5,2,55] ] Shape: = [3, 4]`
Token embeddings	[VOCAB_SIZE, EMB_DIM]	Stores vectors of EMB_DIM dimensionality. Each vector represent a token in the vocabulary. Embeddings are learnable parameters	Image loading is in progress...
Position embeddings	[SEQ_LEN, EMB_DIM]	Stores information about token position in a sequence. Not learnable parameters (usually). Are calculated using the sinusoidal formula	Image loading is in progress...
Input embeddings (the result of embedding layer)	[BATCH_SIZE, SEQ_LEN, EMB_DIM]	Adds embedding dimensionality to existing batch_size x seq_len of inputs. Make tokens look like vectors	Image loading is in progress...

That's all regarding token sequences and embeddings.

Next chapter: Normalisation and Attention