Architecture

BERT

2018ActivePublished

Key innovation

Introduced deep bidirectional language representations by pre-training a Transformer encoder with Masked Language Modeling, enabling a single model to be fine-tuned across many NLP tasks with minimal architectural changes.

How it works

1) WordPiece tokenization splits text into subwords; [CLS] is prepended (used as aggregate sequence representation) and [SEP] separates segments. 2) Each token's input is the sum of three embeddings: token, segment (A/B), and positional. 3) The sequence passes through a stack of Transformer encoder layers (12 or 24), each with Multi-Head Self-Attention without causal masking (attention is bidirectional) plus a feed-forward sublayer, both with residual connections and layer normalization. 4) Pre-training: ~15% of tokens are masked (of which 80% replaced with [MASK], 10% with a random token, 10% kept) and the model predicts the original (MLM); concurrently the model classifies whether sentence B follows sentence A (NSP). 5) Fine-tuning: a small task head is added (e.g. linear over the [CLS] representation for classification, or span prediction over all token outputs for QA), and the whole model is trained end-to-end with a small learning rate.

Problem solved

Earlier language models (LSTM-LM, ELMo, GPT-1) were either unidirectional or merely concatenated two independent unidirectional models, which limited contextual word representation quality. There was also no universal pre-trained model that, after fine-tuning, would achieve top performance across many diverse NLP tasks without bespoke architectures.

Components

Transformer Encoder StackProduces contextual token representations conditioned on full bidirectional context.

Stack of 12 (Base) or 24 (Large) identical Transformer encoder layers, each with Multi-Head Self-Attention and a feed-forward sublayer.

WordPiece TokenizerSplits text into subwords, mitigates OOV problem.

Subword tokenizer with a vocabulary of ~30,000 pieces.

Token / Segment / Position EmbeddingsRepresent token identity, segment (A/B) membership, and position in the sequence.

Three learned embedding tables summed as input to the encoder.

[CLS] Token HeadAllows fine-tuning to perform classification with a single linear layer.

Special token prepended to the sequence whose final hidden state aggregates the whole sequence for classification tasks.

MLM HeadImplements the MLM objective during pre-training.

Output layer with weights tied to the embedding table, used in pre-training to predict masked tokens.

Implementation

Reference implementations

google-research/bert

Python (TensorFlow)

Official

Hugging Face Transformers — BERT

Python (PyTorch / TensorFlow / JAX)

bert-base-uncased (model card)

Python

Official

Implementation pitfalls

512-token limitHigh

Positional embeddings are learned and capped at 512 positions; longer documents require chunking, sliding-window approaches, or long-context variants (Longformer, BigBird).

Fix:Use sliding window with overlap, hierarchical aggregation, or a long-context model.

Mismatch between pre-training and generationMedium

BERT is a bidirectional encoder — unsuited for autoregressive text generation; use decoder-only (GPT-like) or seq2seq (T5, BART) models for generation instead.

Weakness of Next Sentence PredictionLow

Later work (RoBERTa, ALBERT) showed NSP has little or negative effect; prefer NSP-free variants or improved objectives (e.g. Sentence-Order Prediction in ALBERT).

[MASK] mismatch between pre-training and fine-tuningLow

The [MASK] token appears only during pre-training, not fine-tuning — the authors mitigate this with the 80/10/10 scheme. Awareness matters when modifying the pre-training recipe.