Neural Networks: From Fundamentals to Modern AI · The Attention Mechanism and the Transformer
Tokenization and BPE — why text is neither characters nor words
The Attention Mechanism and the Transformer
Introduction
A Transformer does not operate on characters or words — it operates on tokens from a fixed vocabulary of typically 30k–200k. The choice of tokenization is not a technical detail but an architectural decision that affects: (1) context length measured in tokens, (2) model quality on rare words and morphologically rich languages, (3) embedding parameter count (vocab × d_model), (4) API pricing (usually charged per token). The simplest approaches are extreme: char-level (vocab ≈ 100 but cripplingly long sequences) and word-level (vocab in the millions, OOV problem for new words). The sweet spot is subword tokenization. Key algorithms: Byte Pair Encoding (BPE — Sennrich et al. 2016, GPT-2/3/4, LLaMA), WordPiece (Schuster & Nakajima 2012, BERT), Unigram (Kudo 2018, T5, ALBERT), SentencePiece (Kudo & Richardson 2018 — a language- and whitespace-agnostic implementation). BPE starts from characters/bytes and iteratively merges the most frequent pairs until it reaches a target vocab_size. GPT-2 introduced byte-level BPE: it treats all 256 UTF-8 bytes as base tokens, so OOV is impossible — any byte sequence can be represented. Practical consequences: 1 token ≈ 4 English characters, ≈ 0.75 words; Polish text uses ≈30% more tokens than the English equivalent (less represented in training), source code with dedicated tokenizers (Code Llama) compresses much more efficiently.