1) WordPiece tokenization splits text into subwords; [CLS] is prepended (used as aggregate sequence representation) and [SEP] separates segments. 2) Each token's input is the sum of three embeddings: token, segment (A/B), and positional. 3) The sequence passes through a stack of Transformer encoder layers (12 or 24), each with Multi-Head Self-Attention without causal masking (attention is bidirectional) plus a feed-forward sublayer, both with residual connections and layer normalization. 4) Pre-training: ~15% of tokens are masked (of which 80% replaced with [MASK], 10% with a random token, 10% kept) and the model predicts the original (MLM); concurrently the model classifies whether sentence B follows sentence A (NSP). 5) Fine-tuning: a small task head is added (e.g. linear over the [CLS] representation for classification, or span prediction over all token outputs for QA), and the whole model is trained end-to-end with a small learning rate.
Earlier language models (LSTM-LM, ELMo, GPT-1) were either unidirectional or merely concatenated two independent unidirectional models, which limited contextual word representation quality. There was also no universal pre-trained model that, after fine-tuning, would achieve top performance across many diverse NLP tasks without bespoke architectures.
Stack of 12 (Base) or 24 (Large) identical Transformer encoder layers, each with Multi-Head Self-Attention and a feed-forward sublayer.
Subword tokenizer with a vocabulary of ~30,000 pieces.
Three learned embedding tables summed as input to the encoder.
Special token prepended to the sequence whose final hidden state aggregates the whole sequence for classification tasks.
Output layer with weights tied to the embedding table, used in pre-training to predict masked tokens.
Positional embeddings are learned and capped at 512 positions; longer documents require chunking, sliding-window approaches, or long-context variants (Longformer, BigBird).
BERT is a bidirectional encoder — unsuited for autoregressive text generation; use decoder-only (GPT-like) or seq2seq (T5, BART) models for generation instead.
Later work (RoBERTa, ALBERT) showed NSP has little or negative effect; prefer NSP-free variants or improved objectives (e.g. Sentence-Order Prediction in ALBERT).
The [MASK] token appears only during pre-training, not fine-tuning — the authors mitigate this with the 80/10/10 scheme. Awareness matters when modifying the pre-training recipe.
Devlin et al. release BERT on arXiv (October 2018) with Base/Large models and TensorFlow code.
Facebook AI shows BERT was undertrained; removing NSP, larger batches, and more data yields substantially better results.
Hugging Face introduces DistilBERT — 40% smaller, 60% faster, preserving 97% of BERT-Base quality.
Google introduces ALBERT with cross-layer parameter sharing and replaces NSP with Sentence-Order Prediction.
The 'replaced token detection' objective proves more sample-efficient than MLM.
Google announces BERT in production for query understanding — one of the largest single quality changes in Search history.
Time complexity: O(L · n² · d). Space complexity: O(L · n² + n · d + L · d²).
Depth of the encoder stack. BERT-Base: 12, BERT-Large: 24.
Internal representation dimensionality. Base: 768, Large: 1024.
Base: 12, Large: 16.
BERT uses a fixed length of 512 tokens (limit of learned positional embeddings).
Standard vocabulary is 30,522 (uncased) or 28,996 (cased) tokens.
15% of tokens selected for MLM; of those 80% replaced with [MASK], 10% with random tokens, 10% kept unchanged.
All parameters are active for every input (dense forward pass), with no routing or sparsity.
Self-attention and FFN are fully parallel across tokens within a layer; no recurrence (unlike RNN/LSTM).
Operations are dense matrix multiplications (GEMMs) ideal for tensor cores; FP16/BF16 mixed precision delivers significant speedup.
The original BERT was pre-trained on Google TPUs; the architecture maps well to systolic MMUs.