Part 1: Intro | Everything about tokenization

Intro

Let's imagine you are traveling to another country and making a short stop for a cup of coffee in a local coffee shop. How are you going to make an order if the cashier speaks only his native language and you don't? Thank God we have translators! So you put your order there, get a translation in the cashier's language and voilà, communication is established!

We have successfully established person-to-person communication in different languages. What about the next challenge: person-to-model?

What happens when we put down our request into the dialog window asking "What is the best fantasy book of all time? "

Unfortunately, the models do not speak any human language yet. But good news, we can translate our message to theirs! The only question is what language they speak and how to translate?

This is where the tokenizer takes the scene. Tokenizer is literally a translator from human language. That's what we need! Now let's see what it looks like. Let's answer our question again:

"What is the best fantasy book of all time?"

... Translating...

[4827, 382, 290, 1636, 30988, 2392, 328, 198, 586, 4238, 30]

What's happened? Let's break it down.

We put the string (our question) into the tokenizer and got a sequence of tokens instead. Tokens might be defined in a different way depending on the tokenizer type, but we will speak about it later.

In this particular case, I used tiktokenizer (tokenizer used by OpenAI for GPT models) to break the sentence into smaller parts. You can experiment yourself by following this link: Tokenizer playground

Image loading is in progress...

What - 4827 is - 382 the - 290 etc.

It might seem that they are separated by words, but this is not always the case. Let's have a look at the following example:

Tiktokenizer has more than 100 000 tokens in its vocabulary, so a lot of tokens are represented as whole words. Things might get tougher when it comes to more sophisticated text or rare words, as we can see in the example.

Image loading is in progress...

After that, we can put our token sequence into the model and wait for a response.

To wrap it up, a tokenizer is a computational tool that transforms human language into a sequence of tokens (discrete numerical identifiers that represent words, parts of words, or characters) that machine learning models can process mathematically. This translation from natural language to numerical representations is essential because models cannot directly understand text—they can only perform calculations on numbers. The tokenizer serves as the crucial interface between human communication and machine computation, allowing our questions, commands, and conversations to be processed inside the model's neural architecture.

Next chapter: Tokenizer types