Other

Tokenization

1994ActivePublished: 17 May 2026Updated: 17 May 2026Published

How it works

1. Tokenizer training (offline, one-time): the BPE/WordPiece/Unigram algorithm analyzes a reference corpus and builds a subword vocabulary. BPE iteratively merges the most frequent pairs of adjacent symbols; WordPiece merges pairs that maximize corpus likelihood under a unigram model; Unigram starts with a large vocabulary and iteratively removes tokens with the smallest impact on likelihood.

2. Encoding (tokenizer inference): input text is split into a sequence of vocabulary tokens, typically via greedy longest-match (WordPiece), priority-queue merges (BPE), or Viterbi (Unigram). The output is a list of integer IDs.

3. Decoding: the reverse process — IDs are mapped back to character sequences. Detokenization must be lossless, which requires special conventions (e.g. the "##" prefix in WordPiece, "▁" in SentencePiece, prefix-space encoding in GPT).

4. Token budget: the model has a fixed context window measured in tokens. Commercial APIs (OpenAI, Anthropic, Google) bill usage in tokens. Tokens per character typically range from ~0.25 (simple English sentences) to >2 (non-Latin scripts, code, structured data).

Problem solved

Neural models require input from a fixed, finite domain represented as vectors. Natural text, however, has an unbounded space of Unicode character sequences. Tokenization solves this: it compresses that infinity into a finite vocabulary (typically 32k–200k entries) in a way that minimizes sequence length for a typical training corpus while still covering all possible inputs. Naive approaches — word-level and character-level — fail at the extremes: word-level produces an enormous vocabulary, cannot handle OOV (out-of-vocabulary) tokens, and struggles with morphology; character-level produces very long sequences and is compute-expensive. Subword tokenization is the compromise that became the industry standard.

Components

Vocabulary

Vocabulary-learning algorithm

Encoding algorithm

Special tokens

Pre-tokenizer and normalization

Implementation

Reference implementations

tiktoken (OpenAI)

SentencePiece (Google)

Hugging Face tokenizers (Rust + Python)

subword-nmt (oryginalna implementacja BPE Sennricha)

anthropic-tokenizer-count (estymacja tokenów Claude)

Implementation pitfalls

Over-segmentation of non-English languagesHigh

Tokenizers trained primarily on English corpora produce 2–3× more tokens for Polish, Japanese, Arabic, or Hindi — raising API cost and shrinking the effective context window for these languages.

Pathological number tokensHigh

BPE splits numbers into corpus-frequency-dependent fragments (e.g. "1234" as "123"+"4" or "12"+"34"), making arithmetic learning hard. Llama 3 and later models mitigate this by forcing digit-splitting.

Glitch tokensMedium

Tokens present in the vocabulary but virtually absent from training data (e.g. " SolidGoldMagikarp" in GPT-3) cause deterministic, unintelligible model outputs — surface area for prompt injection and jailbreaking.

Letter counting in wordsLow

Questions like "how many letters R are in strawberry" fail because the model sees "straw"+"berry" as two tokens — it has no access to character-level representation. The canonical tokenization-failure benchmark.

Tokenizer mismatch in fine-tuningHigh

Using a different tokenizer than during pretraining (e.g. adding new special tokens without resizing embeddings) leads to silent model corruption — output looks sensible but quality collapses.