Shape of everything inside a transformer

While learning so much about GPT-2 architecture, I constantly come across the same issue – shapes, dimensions, transpositions, additional batch dimensions to consider, and every further problem related to that.

The purpose of this article is to go through the whole GPT model and thoroughly think of any dimension and its purpose. To add some organisation to these writings, let's go one by one in a logical order.

A couple of assumptions:

The information provided is true for basic nanoGPT like transformer architecture
The following articles are focused on training mode

Model forward pass:

Token sequences
Embeddings
Attention layer
MLP layer
Final output

And I will add some final notes after all.

Part 1: Tokens and embeddings Part 2: Normalisation and Attention Part 3: MLP layer and model outputs