1. Initialization: the corpus is split into words, each word into a sequence of base symbols (characters or bytes). The initial vocabulary contains those base symbols.
2. Pair counting: for each word in the corpus, all pairs of adjacent symbols are counted, weighted by word frequency.
3. Most-frequent-pair selection: the pair (a, b) with the highest aggregate corpus frequency is chosen as the merge candidate.
4. Vocabulary and corpus update: a new symbol "ab" is added to the vocabulary; all occurrences of the pair (a, b) in the corpus are replaced with this symbol. The pair is recorded on the merge list.
5. Repeat: steps 2โ4 are repeated until the vocabulary reaches the target size (or the most-frequent-pair count falls below a threshold).
6. Encoding new text: text is split into base symbols; the merge list is applied in priority order โ reproducing the same merges the algorithm performed on the corpus. Production implementations (tiktoken, HF tokenizers) use a priority queue for O(n log n) performance.
Word-level tokenization fails on OOV, morphology, and multilinguality; character-level produces very long sequences. BPE solves both at once: it learns a vocabulary that compresses common sequences (shorter than char-level) while preserving coverage of rare words via their subunits (no <UNK> for essentially arbitrary text).
BPE merges should not cross word or whitespace boundaries โ without a proper pre-tokenizer regex (GPT-2 pattern), the tokenizer learns "nonsensical" merges joining the end of one word with the beginning of the next.
A vocabulary trained primarily on English produces 2โ3ร more tokens for Polish, Japanese, Arabic โ raising API cost and shrinking the effective context window.
Without forced digit-splitting, BPE splits numbers into corpus-frequency-dependent fragments ("1234" โ "123"+"4" or "12"+"34"). Llama 3 and GPT-4o mitigate this by forcing single-digit tokens.
Merges generating tokens that are rare in training data (e.g. "SolidGoldMagikarp" in GPT-3) cause deterministic model failures โ an attack surface for prompt injection and jailbreaks.
Implementations with different tie-breaking for equal-frequency pairs may produce different tokenizations of the same text โ diverging training and inference vocabularies. Always use the official model implementation (tiktoken for GPT, tokenizer.json HF).