Robots AtlasRobots Atlas

Transformer

Self-attention mechanism replacing recurrence and convolutions, enabling parallel sequence processing and modeling of long-range dependencies.

Category
Abstraction level
Operation level
Language models (LLMs), machine translation, text classification, code generation, embeddings, vision (ViT), multimodality (CLIP), tabular foundation models (TabPFN), recommendation models, protein structure prediction (AlphaFold).

The input is tokenized and embedded in d_model space, then positional encoding is added. The sequence passes through a stack of identical blocks: each block contains multi-head self-attention (Query/Key/Value computed from embeddings; attention(Q,K,V)=softmax(QK^T/√d_k)V) and a position-wise feed-forward network (typically 4× the width of d_model). Residual connections and LayerNorm are applied around each sublayer. In the encoder-decoder variant, the decoder has an additional cross-attention layer over the encoder output and causal masking in self-attention.

Modeling long-range dependencies in sequences with fully parallelizable training computations — something RNNs/LSTMs could not provide due to the sequential nature of their temporal propagation.

01

Multi-Head Self-Attention (MHSA)

Modeling contextual dependencies between sequence positions

Modular

Mechanism allowing every position in the sequence to attend to all other positions. The input is projected into three matrices: Query (Q), Key (K), Value (V). Attention weights are computed as softmax(QK^T/sqrt(d_k)) multiplied by V. Multiple 'heads' perform this operation in parallel in low-dimensional subspaces, and the results are concatenated and projected through W_O.

i/o
in
[B, T, d_model]B = batch, T = sequence length, d_model = representation dimension (e.g., 512, 4096).
out
[B, T, d_model]Sequence of contextual representations with the same shape as the input.
Scaled Dot-Product AttentionFlashAttentionGrouped-Query Attention (GQA)Multi-Query Attention (MQA)Sliding Window / Local Attention
02

FFN / MLP Block

Non-linear per-position transformation of representations; the main carrier of parametric capacity

Modular

Two-layer MLP (Linear → activation → Linear) applied independently to each position in the sequence. Originally: ReLU, hidden dimension d_ff = 4·d_model. In modern models often GLU/SwiGLU/GeGLU (LLaMA, PaLM).

i/o
in
[B, T, d_model]
out
[B, T, d_model]
Vanilla FFN (ReLU)SwiGLUMixture of Experts (MoE) FFN
03

LayerNorm / RMSNorm

Stabilization of gradients and activations in deep layer stacks

Modular

Per-token normalization of activations within a layer, crucial for training stability in deep Transformers. Originally Post-LN (after sub-block); modern models predominantly use Pre-LN (before sub-block) or RMSNorm (a simplified version without centering).

Post-LNPre-LNRMSNorm
04

Residual / skip connections

Gradient and information propagation through deep stacks

Adding the sub-block input to its output (y = x + f(x)). Enables training of hundreds-of-layers models without vanishing gradients and provides a 'residual stream' — an interpretable inter-layer communication channel (Elhage et al. 2021).

05

Positional Encoding (PE)

Introducing sequence order into the permutation-invariant attention mechanism

Modular

Mechanism for injecting token position information into a model that lacks recurrence. Originally: additive sinusoidal PE. Today, relative positions dominate (RoPE — Rotary Position Embedding, ALiBi), which generalize to longer contexts.

Sinusoidal (absolute)Learned absoluteRoPE (Rotary)ALiBi
Time

Standard full self-attention, sequence length T, model dimension d_model, FFN hidden dimension d_ff (typically 4·d_model).

For long contexts (T » d_model) the quadratic attention term dominates. FlashAttention does not change the asymptotic complexity but drastically reduces HBM↔SRAM data movement.

Memory complexity

The T×T attention matrix must be materialized in the naive implementation. KV-cache during autoregressive inference: O(L · T · d_model) where L = number of layers.

KV-cache is the main memory bottleneck in long-context LLM inference — hence MQA/GQA, paged attention (vLLM).

Wąskie gardło: Quadratic attention complexity

Standard self-attention scales as O(T²) with respect to sequence length, becoming the dominant cost for long contexts (>8K tokens). Memory-wise, the T×T attention matrix becomes infeasible to materialize without tile-based techniques (FlashAttention).

Parallelism

Fully parallel

Full parallelism over sequence positions was the main reason Transformers replaced RNNs — RNNs required sequential unrolling in time even during training.

Paradigm

Dense

All paths active

The MoE variant (Switch Transformer, Mixtral) shifts the paradigm to conditional/mixture, but this is an extension of the classical Transformer, not the Transformer itself.

Strengths

  • Fully parallelizable training (unlike RNNs). Models long-range dependencies in O(1) steps. Excellent scaling with parameter count and data (scaling laws). Transferable representations to many downstream tasks. Modality-agnostic architecture.

Limitations

  • Quadratic memory and time complexity with respect to sequence length (O(n²)). High VRAM/HBM requirements for long contexts. No built-in translation invariance (must be learned from data). Training requires massive datasets and compute.

Common pitfalls

Post-LN instability in deep models
HIGH

The original Post-LayerNorm layout (LN after the residual) leads to large activation norms in deep stacks (>12 layers) and requires expensive learning-rate warmup and careful initialization tuning. Without it, training diverges.

Use Pre-LN or RMSNorm. Most modern Transformers (GPT-2+, LLaMA, PaLM) use Pre-LN.

Missing QK^T scaling by sqrt(d_k)
CRITICAL

If the Q·K dot product is not divided by sqrt(d_k), softmax saturates for large d_k, gradient vanishes, and the model fails to learn.

Always divide attention logits by sqrt(d_k) before softmax — this is not an optional optimization, it is a numerical foundation.

Materializing full T×T attention matrix
HIGH

A naive implementation allocates a T×T matrix in HBM, which for T=32K and fp16 requires 2 GB per layer — quickly exhausting GPU memory at long contexts.

Use FlashAttention (PyTorch SDPA backend, xFormers, Triton kernels) — tile-based, never materializes the full matrix.

KV-cache grows linearly with context length
HIGH

During autoregressive inference, the KV-cache occupies O(L · T · d_model · 2) memory. For LLaMA-3-70B at T=128K this is ~40 GB — limiting batch size and throughput.

GQA/MQA (K/V head reduction), paged attention (vLLM), KV-cache quantization (Q4/Q8), sliding window attention.

Incorrect causal mask in the decoder
CRITICAL

A bug in the triangular mask (e.g., off-by-one, masking after softmax instead of before) causes future-token leakage during training — the model looks great in training but fails at inference.

Use built-in implementations (PyTorch nn.Transformer, HF transformers). Write tests verifying that prediction at t does not depend on tokens >t.

Confusing Pre-LN and Post-LN when loading weights
MEDIUM

BERT (Post-LN) and GPT-2/LLaMA (Pre-LN) weights are not interchangeable in the same code — loading weights into the wrong layout produces garbage predictions without a runtime error.

Always check the model's config.json and use the appropriate class from the transformers library.

GENESIS · Source paper

Attention Is All You Need
2017NeurIPS 2017Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
2014

Attention mechanism in seq2seq

Bahdanau et al. introduce soft attention in RNN-based neural machine translation — a precursor to self-attention.

2017

Attention Is All You Need

breakthrough

Vaswani et al. publish the Transformer architecture, eliminating recurrence in favor of pure multi-head self-attention.

2018

BERT and GPT-1

breakthrough

Devlin et al. (BERT, encoder-only) and Radford et al. (GPT, decoder-only) prove that Transformer pre-training on massive corpora + fine-tuning is the dominant NLP paradigm.

2020

GPT-3 and scaling laws

breakthrough

Brown et al. (GPT-3, 175B) and Kaplan et al. (scaling laws) demonstrate predictable performance scaling with parameters, data, and compute.

2020

Vision Transformer (ViT)

breakthrough

Dosovitskiy et al. show that a Transformer operating on image patches outperforms CNNs on ImageNet — the start of Transformer expansion beyond NLP.

2022

FlashAttention

Dao et al. introduce I/O-aware exact attention, reducing HBM↔SRAM data movement and making long-context training practical.

2023

Mixture-of-Experts mainstream (Mixtral 8x7B)

Mistral AI releases Mixtral — an open-weight MoE Transformer matching GPT-3.5 at a fraction of inference cost.

2024

Reasoning models (o1, DeepSeek-R1)

breakthrough

OpenAI o1 and DeepSeek-R1 introduce test-time compute scaling via long chain-of-thought built on Transformer — a new axis of scaling.

GPU Tensor CoresPRIMARY

Dense matmul (QK^T, attention·V, FFN) maps perfectly to Tensor Cores. NVIDIA H100/H200/B200 with FP8/FP16/BF16 are the de facto standard for Transformer training and inference.

TPUPRIMARY

Google TPU v4/v5/v6 were designed with Transformers in mind (PaLM, Gemini). XLA + JAX/Flax deliver high performance at large scale.

CPU AVXLIMITED

Inference of small Transformers (e.g., BERT-base) on CPU AVX-512/AMX is feasible (Intel oneDNN, llama.cpp), but latency scales linearly with sequence length.

FPGAPOSSIBLE

FPGAs are used in specialized low-latency applications (HFT, edge), but HBM memory limits and toolchain constraints make them a niche choice.

Related AI models

TabPFN

1

Other

1