Kseniya Parkhamchuk

Shape of everything inside a transformer

While learning so much about GPT-2 architecture, I constantly come across the same issue – shapes, dimensions, transpositions, additional batch dimensions to consider, and every further problem related to that.

The purpose of this article is to go through the whole GPT model and thoroughly think of any dimension and its purpose. To add some organisation to these writings, let's go one by one in a logical order.

A couple of assumptions:

  1. The information provided is true for basic nanoGPT like transformer architecture
  2. The following articles are focused on training mode

Model forward pass:

  1. Token sequences
  2. Embeddings
  3. Attention layer
  4. MLP layer
  5. Final output

And I will add some final notes after all.

Shape of everything inside a transformer