Other

BPE

1994ActivePublished: 17 May 2026Updated: 17 May 2026Published

How it works

1. Initialization: the corpus is split into words, each word into a sequence of base symbols (characters or bytes). The initial vocabulary contains those base symbols.

2. Pair counting: for each word in the corpus, all pairs of adjacent symbols are counted, weighted by word frequency.

3. Most-frequent-pair selection: the pair (a, b) with the highest aggregate corpus frequency is chosen as the merge candidate.

4. Vocabulary and corpus update: a new symbol "ab" is added to the vocabulary; all occurrences of the pair (a, b) in the corpus are replaced with this symbol. The pair is recorded on the merge list.

5. Repeat: steps 2–4 are repeated until the vocabulary reaches the target size (or the most-frequent-pair count falls below a threshold).

6. Encoding new text: text is split into base symbols; the merge list is applied in priority order — reproducing the same merges the algorithm performed on the corpus. Production implementations (tiktoken, HF tokenizers) use a priority queue for O(n log n) performance.

Problem solved

Word-level tokenization fails on OOV, morphology, and multilinguality; character-level produces very long sequences. BPE solves both at once: it learns a vocabulary that compresses common sequences (shorter than char-level) while preserving coverage of rare words via their subunits (no <UNK> for essentially arbitrary text).

Components

Base alphabet

Merge list

Pre-tokenizer

Special tokens

Implementation

Reference implementations

tiktoken (OpenAI)

subword-nmt (Sennrich, oryginalna implementacja NLP)

SentencePiece (Google) — BPE i Unigram

Hugging Face tokenizers (Rust + Python)

minbpe (Andrej Karpathy, edukacyjna implementacja)

Implementation pitfalls

Suboptimal pre-tokenizationHigh

BPE merges should not cross word or whitespace boundaries — without a proper pre-tokenizer regex (GPT-2 pattern), the tokenizer learns "nonsensical" merges joining the end of one word with the beginning of the next.

Language bias of the corpusHigh

A vocabulary trained primarily on English produces 2–3× more tokens for Polish, Japanese, Arabic — raising API cost and shrinking the effective context window.

Number tokenization pathologyHigh

Without forced digit-splitting, BPE splits numbers into corpus-frequency-dependent fragments ("1234" → "123"+"4" or "12"+"34"). Llama 3 and GPT-4o mitigate this by forcing single-digit tokens.

Glitch tokensMedium

Merges generating tokens that are rare in training data (e.g. "SolidGoldMagikarp" in GPT-3) cause deterministic model failures — an attack surface for prompt injection and jailbreaks.

Non-deterministic encoding implementationsMedium

Implementations with different tie-breaking for equal-frequency pairs may produce different tokenizations of the same text — diverging training and inference vocabularies. Always use the official model implementation (tiktoken for GPT, tokenizer.json HF).