A step-by-step visual explanation.
Everything begins with a simple string of text, like a question or a sentence.
"What is the best book?"
The text is broken down into smaller pieces called "tokens". These can be words or sub-words.
Each token is mapped to a unique integer from a predefined vocabulary.
Each token ID is converted into a vector (a list of numbers), called a Token Embedding. This vector captures the token's meaning. Separately, a Positional Embedding is created for each position (0, 1, 2, ...) to give the model a sense of word order.
Finally, the Token Embedding and the Positional Embedding are added together for each token to produce the final embedding that is fed into the Transformer model.
This final vector, which contains information about both the meaning and position of the token, is ready for the Transformer's attention mechanism.