While learning so much about GPT-2 architecture, I constantly come across the same issue – shapes, dimensions, transpositions, additional batch dimensions to consider, and every further problem related to that.
The purpose of this article is to go through the whole GPT model and thoroughly think of any dimension and its purpose. To add some organisation to these writings, let's go one by one in a logical order.
A couple of assumptions:
Model forward pass:
And I will add some final notes after all.