As for many other things, there is no one unified or the best tokenizer every model can use. Different techniques of tokenization exist and they are suitable for different kinds of tasks.
There are 3 types of tokenizers:
Let's have a look at the most common examples:
Tokenizer | Characteristics |
---|---|
Character | Splits text into individual characters |
Subword | |
Byte-Pair Encoding (BPE) | Algorithm that iteratively merges the most frequent pairs of chracters or character sequences |
Unigram Language Model | Algorighm is based on probabalistic language model. Process:
|
WordPiece | Splits words into subwords. Merges token pairs that maximize the likelihood of the training data |
Tiktoken | OpenAI tokenizer for GPT models (BPE with optimizations). Implemented on Rust. Uses regex based merging rules. |
SentencePiece | Tokenization framework that treats text as a raw stream (ignoring spaces and punctuation) and applies BPE or Unigram to segment it |
Word | Splits text into words |
Before making a deeper dive into each one, I would like to add a little bit of context for better future understanding. Each tokenizer has its own vocabulary, where it can find all available special tokens, characters, subwords, words (depending on tokenizer type) to encode input sequences.
Let's have a closer look at the vocabulary of a Hermes model.
Ok, fine, but where do they come from?
In reality, it's not enough just to implement a tokenizer. You should also train it to encode your text in a best possible way. While training, you are preparing a bunch of data (usually some general datasets like Pile.OpenWebText, but for more advanced use cases might be domain-spesitic like Pubmed data), feeding it to tokenizer, and it starts to segment it according to the algorithm it uses. Segment and save its findings. It might be saving all unique characters and words, finding the most frequent pairs (BPE), removing tokens base on the probability of its usage, etc. ...
Whatever it takes, the expected result is vocabulary.
Let's move back to tokenizer types and have a look at what the tokenization process looks like.
String sequence -> splitting by characters -> mapping with vocabulary -> tokens
"Hello! How are you?"
↓
["H", "e", "l", "l", "o", "!","_", "H", "o", "w", "_","a", "r", "e", "_", "y", "o", "u", "?"]
↓
["H" -> 15, "e" -> 5, "l" -> 10, "l" -> 10, "o" -> 20, "!" -> 51,"" -> 101, "H" -> 15, "o" -> 20, "w" -> 24, "" -> 101,"a" -> 1, "r" -> 18, "e" -> 5, "_" -> 101, "y" -> 56, "o" -> 20, "u" -> 45, "?" -> 155]
↓
[15, 5, 10, 10, 20, 51, 101, 15, 20, 24, 101, 1, 18, 5, 101, 56, 20, 45, 155]
P.S The numerical representation does not correspond with any real-world tokenizer; This is just an example.
Advantages
Disadvantages
Use cases:
Usually, such tokenizers might be good for specific tasks:
Use case | Task | Reason | Example |
---|---|---|---|
1. Text anomalies | Detecting text anomalies in multilingual user input | Subword or word tokenizers will be struggling with this task. One of the models' specifics is that they can not see specific characters, they see tokens which might contain several characters or even a world. That's why analyzing the text char by char is a much more difficult task than it may sound. |
|
2. Real-time spell correction for multilingual customer support | Build a customer support chatbot for a global e-commerce platform, where users' queries are in different languages, often with typos, slang, mixed scripts | Character level analysis | "I ned help with mi ordar". The bot should be able to correct these misspellings in real time to understand and respond. |
3. Analyzing and Generating Synthetic DNA Sequences | Detecting anomalies or generating synthetic DNA. DNA is a string of characters (ATCG). You need a system to identify mutations (e.g. ATX 6 ) | Character level analysis allows you:
| input -> tokenization -> output ATCGTACG -> ['A', 'T', 'C', 'G', 'T', 'A', 'C', 'G'] -> Normal |
Lets start with an example
"Hello! My name is Kseniya. My phone number is +123456789"
↓
["Hello", "!", "My", "name", "is", "Kseniya", ".", "My", "phone", "number", "is", "+123456789"]
↓
[11351, 145, 1000, 5341, 511, <UNK>, 2135, 1000, 15111, 193, 3514, 511, <UNK>]
As soon as "Kseniya" and "+123456789" are more likely to be considered as OOV (out-of-vocabulary) words, they will be replaced with the
Downsides:
Advantages:
Use cases:
Use case | Task | Reason | Example |
---|---|---|---|
1. Sentiment analysis | Determine the emotional tone of the text, such as whether a product review is positive, negative or neutral |
| "I love this product" - the word "love" is a strong positive indicator. |
2. Named Entity Recognition (NER) in English | Identify and classify named entities in text, such as people ("John Smith"), organisations ("Google") and so on. |
| Extaracting key information for the business:
|
Sub-word tokenization is the most popular and beneficial way to tokenize the input sequence in the majority of cases. Let's see how it works:
"In subword tokenization, rare words like ‘unhappiness’ are split into smaller parts such as ‘un’, ‘happi’, and ‘ness’ to improve language model understanding."
↓
↓
[1, 818, 850, 4775, 11241, 1634, 11, 4071, 2456, 588, 564, 246, 403, 71, 42661, 447, 247, 389, 6626, 656, 4833, 3354, 884, 355, 564, 246, 403, 447, 247, 11, 564, 246, 71, 1324, 72, 447, 247, 11, 290, 564, 246, 1108, 447, 247, 284, 2987, 3303, 2746, 4547, 526]
As it is one of the most popular ways, there are some particular advantages for using it:
Downsides:
Use cases:
Use case | Reason |
---|---|
Build a translation system that handles multiple European languages (English, German, Spanish) |
|
Scientific/Medtext processing. A biomedical NLP system analyzing research papers and clinical notes. |
|
That's it for tokenizer types. The next chapter contains more details and insights about a particular algorithm of sub-word tokenization – Byte-Pair Encoding (BPE).