Part 2: Tokenizer types | Everything about tokenization

Tokenizer types

As for many other things, there is no one unified or the best tokenizer every model can use. Different techniques of tokenization exist and they are suitable for different kinds of tasks.

There are 3 types of tokenizers:

character
subword
word

Let's have a look at the most common examples:

Tokenizer	Characteristics
Character	Splits text into individual characters
Subword
Byte-Pair Encoding (BPE)	Algorithm that iteratively merges the most frequent pairs of chracters or character sequences
Unigram Language Model	Algorighm is based on probabalistic language model. Process: Define an initial (large) vocabulary of subword candidates. Use an EM (Expectation-Maximization) algorithm to optimize subword probabilities. Prune the vocabulary by removing low-probability subwords.
WordPiece	Splits words into subwords. Merges token pairs that maximize the likelihood of the training data
Tiktoken	OpenAI tokenizer for GPT models (BPE with optimizations). Implemented on Rust. Uses regex based merging rules.
SentencePiece	Tokenization framework that treats text as a raw stream (ignoring spaces and punctuation) and applies BPE or Unigram to segment it
Word	Splits text into words

Before making a deeper dive into each one, I would like to add a little bit of context for better future understanding. Each tokenizer has its own vocabulary, where it can find all available special tokens, characters, subwords, words (depending on tokenizer type) to encode input sequences.

Let's have a closer look at the vocabulary of a Hermes model.

Hermes model vocabulary

Image loading is in progress...

Ok, fine, but where do they come from?

In reality, it's not enough just to implement a tokenizer. You should also train it to encode your text in a best possible way. While training, you are preparing a bunch of data (usually some general datasets like Pile.OpenWebText, but for more advanced use cases might be domain-spesitic like Pubmed data), feeding it to tokenizer, and it starts to segment it according to the algorithm it uses. Segment and save its findings. It might be saving all unique characters and words, finding the most frequent pairs (BPE), removing tokens base on the probability of its usage, etc. ...

Whatever it takes, the expected result is vocabulary.

Let's move back to tokenizer types and have a look at what the tokenization process looks like.

Character level tokenization

String sequence -> splitting by characters -> mapping with vocabulary -> tokens

"Hello! How are you?"

↓

["H", "e", "l", "l", "o", "!","_", "H", "o", "w", "_","a", "r", "e", "_", "y", "o", "u", "?"]

↓

["H" -> 15, "e" -> 5, "l" -> 10, "l" -> 10, "o" -> 20, "!" -> 51,"" -> 101, "H" -> 15, "o" -> 20, "w" -> 24, "" -> 101,"a" -> 1, "r" -> 18, "e" -> 5, "_" -> 101, "y" -> 56, "o" -> 20, "u" -> 45, "?" -> 155]

↓

[15, 5, 10, 10, 20, 51, 101, 15, 20, 24, 101, 1, 18, 5, 101, 56, 20, 45, 155]

P.S The numerical representation does not correspond with any real-world tokenizer; This is just an example.

Advantages

Has small vocabulary
Easy to implement
Can handle any word
No vocabulary training
Gives full control

Disadvantages

Very difficult for models to find semantics
Long output sequences result in more computation.

Use cases:

Usually, such tokenizers might be good for specific tasks:

Noisy text ("heyyyy", OCR errors)
Code processing

Use case	Task	Reason	Example
1. Text anomalies	Detecting text anomalies in multilingual user input	Subword or word tokenizers will be struggling with this task. One of the models' specifics is that they can not see specific characters, they see tokens which might contain several characters or even a world. That's why analyzing the text char by char is a much more difficult task than it may sound.	"Heyyy im gr8" "나는 좋아해요 coding in Python" "h3llo w0rld"
2. Real-time spell correction for multilingual customer support	Build a customer support chatbot for a global e-commerce platform, where users' queries are in different languages, often with typos, slang, mixed scripts	Character level analysis	"I ned help with mi ordar". The bot should be able to correct these misspellings in real time to understand and respond.
3. Analyzing and Generating Synthetic DNA Sequences	Detecting anomalies or generating synthetic DNA. DNA is a string of characters (ATCG). You need a system to identify mutations (e.g. ATX 6 )	Character level analysis allows you: Spot individual point mutations (e.g., ATCG → ATXG, where X is an anomaly) Generate biologically plausible new sequences one nucleotide at a time.	input -> tokenization -> output ATCGTACG -> ['A', 'T', 'C', 'G', 'T', 'A', 'C', 'G'] -> Normal

Word level tokenization

Lets start with an example

"Hello! My name is Kseniya. My phone number is +123456789"

↓

["Hello", "!", "My", "name", "is", "Kseniya", ".", "My", "phone", "number", "is", "+123456789"]

↓

[11351, 145, 1000, 5341, 511, <UNK>, 2135, 1000, 15111, 193, 3514, 511, <UNK>]

As soon as "Kseniya" and "+123456789" are more likely to be considered as OOV (out-of-vocabulary) words, they will be replaced with the token.

Downsides:

Big vocabulary size (should store many words).
Bad handling of OOV words.
Vocabulary needs more space because of its size.
Can not be used for languages without spacing.
Processing punctuation is hard (like "U.S.A.", "don't").

Advantages:

Simplicity and intuitiveness
Result in smaller token sequence

Use cases:

Use case	Task	Reason	Example
1. Sentiment analysis	Determine the emotional tone of the text, such as whether a product review is positive, negative or neutral	sentiment often hinges on specific words like "great", "terrible", "love", "hate", which carry emotional weight. Word-level tokenizer keeps them as single tokens computationally efficient	"I love this product" - the word "love" is a strong positive indicator.
2. Named Entity Recognition (NER) in English	Identify and classify named entities in text, such as people ("John Smith"), organisations ("Google") and so on.	preserve the word as a meaningful unit straightforward mapping to entity labels ("Apple" - organisation, "New York" - location)	Extaracting key information for the business: identify companies, people, locations, news, reports stay informed about market trends or track competitors

Sub-word tokenization:

Sub-word tokenization is the most popular and beneficial way to tokenize the input sequence in the majority of cases. Let's see how it works:

"In subword tokenization, rare words like ‘unhappiness’ are split into smaller parts such as ‘un’, ‘happi’, and ‘ness’ to improve language model understanding."

↓

Image loading is in progress...

↓


[1, 818, 850, 4775, 11241, 1634, 11, 4071, 2456, 588, 564, 246, 403, 71, 42661, 447, 247, 389, 6626, 656, 4833, 3354, 884, 355, 564, 246, 403, 447, 247, 11, 564, 246, 71, 1324, 72, 447, 247, 11, 290, 564, 246, 1108, 447, 247, 284, 2987, 3303, 2746, 4547, 526]

As it is one of the most popular ways, there are some particular advantages for using it:

Allow to easily handle OOV words
Doesn't need a very large vocabulary in comparison to word-level tokenization
Tokenized sequence is of a reasonable size.

Downsides:

More complex implementation
Mostly morphologically meaningless tokens

Use cases:

Use case	Reason
Build a translation system that handles multiple European languages (English, German, Spanish)	Languages share many morphological patterns that enables a single shared vocabulary. For languages like German that include compound words like "Orangensaft" - orange juice, break the word into meaningful parts.
Scientific/Medtext processing. A biomedical NLP system analyzing research papers and clinical notes.	Many specialized terms with shared suffixes and prefixes ("cardio-", "-ology","-itis") A lot of rare or new words and names that can be handled Easier to build patterns, as e.g., "myocarditis" is likely related to "cardiology" because of shared subword units

That's it for tokenizer types. The next chapter contains more details and insights about a particular algorithm of sub-word tokenization – Byte-Pair Encoding (BPE).