1. Tokenizer training (offline, one-time): the BPE/WordPiece/Unigram algorithm analyzes a reference corpus and builds a subword vocabulary. BPE iteratively merges the most frequent pairs of adjacent symbols; WordPiece merges pairs that maximize corpus likelihood under a unigram model; Unigram starts with a large vocabulary and iteratively removes tokens with the smallest impact on likelihood.
2. Encoding (tokenizer inference): input text is split into a sequence of vocabulary tokens, typically via greedy longest-match (WordPiece), priority-queue merges (BPE), or Viterbi (Unigram). The output is a list of integer IDs.
3. Decoding: the reverse process — IDs are mapped back to character sequences. Detokenization must be lossless, which requires special conventions (e.g. the "##" prefix in WordPiece, "▁" in SentencePiece, prefix-space encoding in GPT).
4. Token budget: the model has a fixed context window measured in tokens. Commercial APIs (OpenAI, Anthropic, Google) bill usage in tokens. Tokens per character typically range from ~0.25 (simple English sentences) to >2 (non-Latin scripts, code, structured data).
Neural models require input from a fixed, finite domain represented as vectors. Natural text, however, has an unbounded space of Unicode character sequences. Tokenization solves this: it compresses that infinity into a finite vocabulary (typically 32k–200k entries) in a way that minimizes sequence length for a typical training corpus while still covering all possible inputs. Naive approaches — word-level and character-level — fail at the extremes: word-level produces an enormous vocabulary, cannot handle OOV (out-of-vocabulary) tokens, and struggles with morphology; character-level produces very long sequences and is compute-expensive. Subword tokenization is the compromise that became the industry standard.
Tokenizers trained primarily on English corpora produce 2–3× more tokens for Polish, Japanese, Arabic, or Hindi — raising API cost and shrinking the effective context window for these languages.
BPE splits numbers into corpus-frequency-dependent fragments (e.g. "1234" as "123"+"4" or "12"+"34"), making arithmetic learning hard. Llama 3 and later models mitigate this by forcing digit-splitting.
Tokens present in the vocabulary but virtually absent from training data (e.g. " SolidGoldMagikarp" in GPT-3) cause deterministic, unintelligible model outputs — surface area for prompt injection and jailbreaking.
Questions like "how many letters R are in strawberry" fail because the model sees "straw"+"berry" as two tokens — it has no access to character-level representation. The canonical tokenization-failure benchmark.
Using a different tokenizer than during pretraining (e.g. adding new special tokens without resizing embeddings) leads to silent model corruption — output looks sensible but quality collapses.