The input is tokenized and embedded in d_model space, then positional encoding is added. The sequence passes through a stack of identical blocks: each block contains multi-head self-attention (Query/Key/Value computed from embeddings; attention(Q,K,V)=softmax(QK^T/√d_k)V) and a position-wise feed-forward network (typically 4× the width of d_model). Residual connections and LayerNorm are applied around each sublayer. In the encoder-decoder variant, the decoder has an additional cross-attention layer over the encoder output and causal masking in self-attention.
Modeling long-range dependencies in sequences with fully parallelizable training computations — something RNNs/LSTMs could not provide due to the sequential nature of their temporal propagation.
Mechanism allowing every position in the sequence to attend to all other positions. The input is projected into three matrices: Query (Q), Key (K), Value (V). Attention weights are computed as softmax(QK^T/sqrt(d_k)) multiplied by V. Multiple 'heads' perform this operation in parallel in low-dimensional subspaces, and the results are concatenated and projected through W_O.
Official
Two-layer MLP (Linear → activation → Linear) applied independently to each position in the sequence. Originally: ReLU, hidden dimension d_ff = 4·d_model. In modern models often GLU/SwiGLU/GeGLU (LLaMA, PaLM).
Official
Per-token normalization of activations within a layer, crucial for training stability in deep Transformers. Originally Post-LN (after sub-block); modern models predominantly use Pre-LN (before sub-block) or RMSNorm (a simplified version without centering).
Official
Adding the sub-block input to its output (y = x + f(x)). Enables training of hundreds-of-layers models without vanishing gradients and provides a 'residual stream' — an interpretable inter-layer communication channel (Elhage et al. 2021).
Mechanism for injecting token position information into a model that lacks recurrence. Originally: additive sinusoidal PE. Today, relative positions dominate (RoPE — Rotary Position Embedding, ALiBi), which generalize to longer contexts.
Official
The original Post-LayerNorm layout (LN after the residual) leads to large activation norms in deep stacks (>12 layers) and requires expensive learning-rate warmup and careful initialization tuning. Without it, training diverges.
If the Q·K dot product is not divided by sqrt(d_k), softmax saturates for large d_k, gradient vanishes, and the model fails to learn.
A naive implementation allocates a T×T matrix in HBM, which for T=32K and fp16 requires 2 GB per layer — quickly exhausting GPU memory at long contexts.
During autoregressive inference, the KV-cache occupies O(L · T · d_model · 2) memory. For LLaMA-3-70B at T=128K this is ~40 GB — limiting batch size and throughput.
A bug in the triangular mask (e.g., off-by-one, masking after softmax instead of before) causes future-token leakage during training — the model looks great in training but fails at inference.
BERT (Post-LN) and GPT-2/LLaMA (Pre-LN) weights are not interchangeable in the same code — loading weights into the wrong layout produces garbage predictions without a runtime error.
Bahdanau et al. introduce soft attention in RNN-based neural machine translation — a precursor to self-attention.
Vaswani et al. publish the Transformer architecture, eliminating recurrence in favor of pure multi-head self-attention.
Devlin et al. (BERT, encoder-only) and Radford et al. (GPT, decoder-only) prove that Transformer pre-training on massive corpora + fine-tuning is the dominant NLP paradigm.
Brown et al. (GPT-3, 175B) and Kaplan et al. (scaling laws) demonstrate predictable performance scaling with parameters, data, and compute.
Dosovitskiy et al. show that a Transformer operating on image patches outperforms CNNs on ImageNet — the start of Transformer expansion beyond NLP.
Dao et al. introduce I/O-aware exact attention, reducing HBM↔SRAM data movement and making long-context training practical.
Mistral AI releases Mixtral — an open-weight MoE Transformer matching GPT-3.5 at a fraction of inference cost.
OpenAI o1 and DeepSeek-R1 introduce test-time compute scaling via long chain-of-thought built on Transformer — a new axis of scaling.
Time complexity: O(T² · d_model) per warstwa (uwaga) + O(T · d_model · d_ff) (FFN). Space complexity: O(T² + T·d_model) per warstwa (naiwnie) → O(T·d_model) z FlashAttention.
Standard self-attention scales as O(T²) with respect to sequence length, becoming the dominant cost for long contexts (>8K tokens). Memory-wise, the T×T attention matrix becomes infeasible to materialize without tile-based techniques (FlashAttention).
The MoE variant (Switch Transformer, Mixtral) shifts the paradigm to conditional/mixture, but this is an extension of the classical Transformer, not the Transformer itself.
Classical Transformer activates all parameters for every token. Routing appears only in MoE-FFN variants, where a subset of parameters (experts) is conditionally activated.
Full parallelism over sequence positions was the main reason Transformers replaced RNNs — RNNs required sequential unrolling in time even during training.
Dense matmul (QK^T, attention·V, FFN) maps perfectly to Tensor Cores. NVIDIA H100/H200/B200 with FP8/FP16/BF16 are the de facto standard for Transformer training and inference.
Google TPU v4/v5/v6 were designed with Transformers in mind (PaLM, Gemini). XLA + JAX/Flax deliver high performance at large scale.
Inference of small Transformers (e.g., BERT-base) on CPU AVX-512/AMX is feasible (Intel oneDNN, llama.cpp), but latency scales linearly with sequence length.
FPGAs are used in specialized low-latency applications (HFT, edge), but HBM memory limits and toolchain constraints make them a niche choice.