Kseniya Parkhamchuk
Back to "Everything about tokenization"

Tokenizer types

As for many other things, there is no one unified or the best tokenizer every model can use. Different techniques of tokenization exist and they are suitable for different kinds of tasks.

There are 3 types of tokenizers:

  • character
  • subword
  • word

Let's have a look at the most common examples:

TokenizerCharacteristics
CharacterSplits text into individual characters
Subword
Byte-Pair Encoding (BPE)Algorithm that iteratively merges the most frequent pairs of chracters or character sequences
Unigram Language ModelAlgorighm is based on probabalistic language model. Process:
  1. Define an initial (large) vocabulary of subword candidates.
  2. Use an EM (Expectation-Maximization) algorithm to optimize subword probabilities.
  3. Prune the vocabulary by removing low-probability subwords.
WordPieceSplits words into subwords. Merges token pairs that maximize the likelihood of the training data
TiktokenOpenAI tokenizer for GPT models (BPE with optimizations). Implemented on Rust. Uses regex based merging rules.
SentencePieceTokenization framework that treats text as a raw stream (ignoring spaces and punctuation) and applies BPE or Unigram to segment it
WordSplits text into words

Before making a deeper dive into each one, I would like to add a little bit of context for better future understanding. Each tokenizer has its own vocabulary, where it can find all available special tokens, characters, subwords, words (depending on tokenizer type) to encode input sequences.

Let's have a closer look at the vocabulary of a Hermes model.

Hermes model vocabulary

image

image

Ok, fine, but where do they come from?

In reality, it's not enough just to implement a tokenizer. You should also train it to encode your text in a best possible way. While training, you are preparing a bunch of data (usually some general datasets like Pile.OpenWebText, but for more advanced use cases might be domain-spesitic like Pubmed data), feeding it to tokenizer, and it starts to segment it according to the algorithm it uses. Segment and save its findings. It might be saving all unique characters and words, finding the most frequent pairs (BPE), removing tokens base on the probability of its usage, etc. ...

Whatever it takes, the expected result is vocabulary.

Let's move back to tokenizer types and have a look at what the tokenization process looks like.

Character level tokenization

String sequence -> splitting by characters -> mapping with vocabulary -> tokens

"Hello! How are you?"

["H", "e", "l", "l", "o", "!","_", "H", "o", "w", "_","a", "r", "e", "_", "y", "o", "u", "?"]

["H" -> 15, "e" -> 5, "l" -> 10, "l" -> 10, "o" -> 20, "!" -> 51,"" -> 101, "H" -> 15, "o" -> 20, "w" -> 24, "" -> 101,"a" -> 1, "r" -> 18, "e" -> 5, "_" -> 101, "y" -> 56, "o" -> 20, "u" -> 45, "?" -> 155]

[15, 5, 10, 10, 20, 51, 101, 15, 20, 24, 101, 1, 18, 5, 101, 56, 20, 45, 155]

P.S The numerical representation does not correspond with any real-world tokenizer; This is just an example.

Advantages

  1. Has small vocabulary
  2. Easy to implement
  3. Can handle any word
  4. No vocabulary training
  5. Gives full control

Disadvantages

  1. Very difficult for models to find semantics
  2. Long output sequences result in more computation.

Use cases:

Usually, such tokenizers might be good for specific tasks:

  • Noisy text ("heyyyy", OCR errors)
  • Code processing
Use caseTaskReasonExample
1. Text anomaliesDetecting text anomalies in multilingual user inputSubword or word tokenizers will be struggling with this task. One of the models' specifics is that they can not see specific characters, they see tokens which might contain several characters or even a world. That's why analyzing the text char by char is a much more difficult task than it may sound.
  1. "Heyyy im gr8"
  2. "나는 좋아해요 coding in Python"
  3. "h3llo w0rld"
2. Real-time spell correction for multilingual customer supportBuild a customer support chatbot for a global e-commerce platform, where users' queries are in different languages, often with typos, slang, mixed scriptsCharacter level analysis"I ned help with mi ordar".
The bot should be able to correct these misspellings in real time to understand and respond.
3. Analyzing and Generating Synthetic DNA SequencesDetecting anomalies or generating synthetic DNA. DNA is a string of characters (ATCG). You need a system to identify mutations (e.g. ATX 6 )Character level analysis allows you:
  1. Spot individual point mutations (e.g., ATCG → ATXG, where X is an anomaly)
  2. Generate biologically plausible new sequences one nucleotide at a time.
    input -> tokenization -> output
    ATCGTACG -> ['A', 'T', 'C', 'G', 'T', 'A', 'C', 'G'] -> Normal

    Word level tokenization

    Lets start with an example

    "Hello! My name is Kseniya. My phone number is +123456789"

    ["Hello", "!", "My", "name", "is", "Kseniya", ".", "My", "phone", "number", "is", "+123456789"]
    

    [11351, 145, 1000, 5341, 511, <UNK>, 2135, 1000, 15111, 193, 3514, 511, <UNK>]
    

    As soon as "Kseniya" and "+123456789" are more likely to be considered as OOV (out-of-vocabulary) words, they will be replaced with the token.

    Downsides:

    • Big vocabulary size (should store many words).
    • Bad handling of OOV words.
    • Vocabulary needs more space because of its size.
    • Can not be used for languages without spacing.
    • Processing punctuation is hard (like "U.S.A.", "don't").

    Advantages:

    • Simplicity and intuitiveness
    • Result in smaller token sequence

    Use cases:

    Use caseTaskReasonExample
    1. Sentiment analysisDetermine the emotional tone of the text, such as whether a product review is positive, negative or neutral
    • sentiment often hinges on specific words like "great", "terrible", "love", "hate", which carry emotional weight. Word-level tokenizer keeps them as single tokens
    • computationally efficient
    "I love this product" - the word "love" is a strong positive indicator.
    2. Named Entity Recognition (NER) in EnglishIdentify and classify named entities in text, such as people ("John Smith"), organisations ("Google") and so on.
    1. preserve the word as a meaningful unit
    2. straightforward mapping to entity labels ("Apple" - organisation, "New York" - location)
    Extaracting key information for the business:
    • identify companies, people, locations, news, reports
    • stay informed about market trends or track competitors

    Sub-word tokenization:

    Sub-word tokenization is the most popular and beneficial way to tokenize the input sequence in the majority of cases. Let's see how it works:

    "In subword tokenization, rare words like ‘unhappiness’ are split into smaller parts such as ‘un’, ‘happi’, and ‘ness’ to improve language model understanding."

    image

    
    [1, 818, 850, 4775, 11241, 1634, 11, 4071, 2456, 588, 564, 246, 403, 71, 42661, 447, 247, 389, 6626, 656, 4833, 3354, 884, 355, 564, 246, 403, 447, 247, 11, 564, 246, 71, 1324, 72, 447, 247, 11, 290, 564, 246, 1108, 447, 247, 284, 2987, 3303, 2746, 4547, 526]
    
    

    As it is one of the most popular ways, there are some particular advantages for using it:

    1. Allow to easily handle OOV words
    2. Doesn't need a very large vocabulary in comparison to word-level tokenization
    3. Tokenized sequence is of a reasonable size.

    Downsides:

    1. More complex implementation
    2. Mostly morphologically meaningless tokens

    Use cases:

    Use caseReason
    Build a translation system that handles multiple European languages (English, German, Spanish)
    1. Languages share many morphological patterns that enables a single shared vocabulary.
    2. For languages like German that include compound words like "Orangensaft" - orange juice, break the word into meaningful parts.
    Scientific/Medtext processing. A biomedical NLP system analyzing research papers and clinical notes.
    1. Many specialized terms with shared suffixes and prefixes ("cardio-", "-ology","-itis")
    2. A lot of rare or new words and names that can be handled
    3. Easier to build patterns, as e.g., "myocarditis" is likely related to "cardiology" because of shared subword units

    That's it for tokenizer types. The next chapter contains more details and insights about a particular algorithm of sub-word tokenization – Byte-Pair Encoding (BPE).