How Transformer Embeddings Work

A step-by-step visual explanation.

Step 1: Start with Input Text

Everything begins with a simple string of text, like a question or a sentence.

"What is the best book?"

↓

Step 2: Tokenization

The text is broken down into smaller pieces called "tokens". These can be words or sub-words.

"What"

"is"

"the"

"best"

"book"

"?"

↓

Step 3: Assign Numerical IDs

Each token is mapped to a unique integer from a predefined vocabulary.

"What"

→

4827

"is"

→

382

"the"

→

290

"best"

→

1636

"book"

→

30988

"?"

→

↓

Step 4: Create Token & Positional Embeddings

Each token ID is converted into a vector (a list of numbers), called a Token Embedding. This vector captures the token's meaning. Separately, a Positional Embedding is created for each position (0, 1, 2, ...) to give the model a sense of word order.

Token Embeddings

4827 ("What") → 0.10.8-0.2...

382 ("is") → 0.30.10.9...

Positional Embeddings

Pos 0 → 0.50.10.2...

Pos 1 → 0.60.10.3...

↓

Step 5: Combine to Create Final Embeddings

Finally, the Token Embedding and the Positional Embedding are added together for each token to produce the final embedding that is fed into the Transformer model.

Token Emb ("What")

0.10.8-0.2

Position Emb (Pos 0)

0.50.10.2

Final Embedding

0.60.90.0

This final vector, which contains information about both the meaning and position of the token, is ready for the Transformer's attention mechanism.