Transformer

Self-attention mechanism replacing recurrence and convolutions, enabling parallel sequence processing and modeling of long-range dependencies.

Multi-Head Self-Attention (MHSA)

Modeling contextual dependencies between sequence positions

Modular

Mechanism allowing every position in the sequence to attend to all other positions. The input is projected into three matrices: Query (Q), Key (K), Value (V). Attention weights are computed as softmax(QK^T/sqrt(d_k)) multiplied by V. Multiple 'heads' perform this operation in parallel in low-dimensional subspaces, and the results are concatenated and projected through W_O.

i/o

[B, T, d_model]B = batch, T = sequence length, d_model = representation dimension (e.g., 512, 4096).

out

[B, T, d_model]Sequence of contextual representations with the same shape as the input.

FFN / MLP Block

Non-linear per-position transformation of representations; the main carrier of parametric capacity

Modular

Two-layer MLP (Linear → activation → Linear) applied independently to each position in the sequence. Originally: ReLU, hidden dimension d_ff = 4·d_model. In modern models often GLU/SwiGLU/GeGLU (LLaMA, PaLM).

i/o

[B, T, d_model]

out

[B, T, d_model]

LayerNorm / RMSNorm

Stabilization of gradients and activations in deep layer stacks

Modular

Per-token normalization of activations within a layer, crucial for training stability in deep Transformers. Originally Post-LN (after sub-block); modern models predominantly use Pre-LN (before sub-block) or RMSNorm (a simplified version without centering).

Residual / skip connections

Gradient and information propagation through deep stacks

Adding the sub-block input to its output (y = x + f(x)). Enables training of hundreds-of-layers models without vanishing gradients and provides a 'residual stream' — an interpretable inter-layer communication channel (Elhage et al. 2021).

Positional Encoding (PE)

Introducing sequence order into the permutation-invariant attention mechanism

Modular

Mechanism for injecting token position information into a model that lacks recurrence. Originally: additive sinusoidal PE. Today, relative positions dominate (RoPE — Rotary Position Embedding, ALiBi), which generalize to longer contexts.

Time

…

Standard full self-attention, sequence length T, model dimension d_model, FFN hidden dimension d_ff (typically 4·d_model).

For long contexts (T » d_model) the quadratic attention term dominates. FlashAttention does not change the asymptotic complexity but drastically reduces HBM↔SRAM data movement.

Memory complexity

…

The T×T attention matrix must be materialized in the naive implementation. KV-cache during autoregressive inference: O(L · T · d_model) where L = number of layers.

KV-cache is the main memory bottleneck in long-context LLM inference — hence MQA/GQA, paged attention (vLLM).

Wąskie gardło: Quadratic attention complexity

Standard self-attention scales as O(T²) with respect to sequence length, becoming the dominant cost for long contexts (>8K tokens). Memory-wise, the T×T attention matrix becomes infeasible to materialize without tile-based techniques (FlashAttention).

Parallelism

Fully parallel

Full parallelism over sequence positions was the main reason Transformers replaced RNNs — RNNs required sequential unrolling in time even during training.

Paradigm

Dense

All paths active

The MoE variant (Switch Transformer, Mixtral) shifts the paradigm to conditional/mixture, but this is an extension of the classical Transformer, not the Transformer itself.

Strengths

Fully parallelizable training (unlike RNNs). Models long-range dependencies in O(1) steps. Excellent scaling with parameter count and data (scaling laws). Transferable representations to many downstream tasks. Modality-agnostic architecture.

Limitations

Quadratic memory and time complexity with respect to sequence length (O(n²)). High VRAM/HBM requirements for long contexts. No built-in translation invariance (must be learned from data). Training requires massive datasets and compute.

Common pitfalls

Post-LN instability in deep models

HIGH

The original Post-LayerNorm layout (LN after the residual) leads to large activation norms in deep stacks (>12 layers) and requires expensive learning-rate warmup and careful initialization tuning. Without it, training diverges.

Use Pre-LN or RMSNorm. Most modern Transformers (GPT-2+, LLaMA, PaLM) use Pre-LN.

Missing QK^T scaling by sqrt(d_k)

CRITICAL

If the Q·K dot product is not divided by sqrt(d_k), softmax saturates for large d_k, gradient vanishes, and the model fails to learn.

Always divide attention logits by sqrt(d_k) before softmax — this is not an optional optimization, it is a numerical foundation.

Materializing full T×T attention matrix

HIGH

A naive implementation allocates a T×T matrix in HBM, which for T=32K and fp16 requires 2 GB per layer — quickly exhausting GPU memory at long contexts.

Use FlashAttention (PyTorch SDPA backend, xFormers, Triton kernels) — tile-based, never materializes the full matrix.

KV-cache grows linearly with context length

HIGH

During autoregressive inference, the KV-cache occupies O(L · T · d_model · 2) memory. For LLaMA-3-70B at T=128K this is ~40 GB — limiting batch size and throughput.

GQA/MQA (K/V head reduction), paged attention (vLLM), KV-cache quantization (Q4/Q8), sliding window attention.

Incorrect causal mask in the decoder

CRITICAL

A bug in the triangular mask (e.g., off-by-one, masking after softmax instead of before) causes future-token leakage during training — the model looks great in training but fails at inference.

Use built-in implementations (PyTorch nn.Transformer, HF transformers). Write tests verifying that prediction at t does not depend on tokens >t.

Confusing Pre-LN and Post-LN when loading weights

MEDIUM

BERT (Post-LN) and GPT-2/LLaMA (Pre-LN) weights are not interchangeable in the same code — loading weights into the wrong layout produces garbage predictions without a runtime error.

Always check the model's config.json and use the appropriate class from the transformers library.

GENESIS · Source paper

Attention Is All You Need

2017NeurIPS 2017Ashish Vaswani, Noam Shazeer, Niki Parmar et al.

2014

Attention mechanism in seq2seq

Bahdanau et al. introduce soft attention in RNN-based neural machine translation — a precursor to self-attention.

Neural Machine Translation by Jointly Learning to Align and Translate

2017

Attention Is All You Need

breakthrough

Vaswani et al. publish the Transformer architecture, eliminating recurrence in favor of pure multi-head self-attention.

Attention Is All You Need

2018

BERT and GPT-1

breakthrough

Devlin et al. (BERT, encoder-only) and Radford et al. (GPT, decoder-only) prove that Transformer pre-training on massive corpora + fine-tuning is the dominant NLP paradigm.

2020

GPT-3 and scaling laws

breakthrough

Brown et al. (GPT-3, 175B) and Kaplan et al. (scaling laws) demonstrate predictable performance scaling with parameters, data, and compute.

Language Models are Few-Shot Learners

2020

Vision Transformer (ViT)

breakthrough

Dosovitskiy et al. show that a Transformer operating on image patches outperforms CNNs on ImageNet — the start of Transformer expansion beyond NLP.

An Image is Worth 16x16 Words

2022

FlashAttention

Dao et al. introduce I/O-aware exact attention, reducing HBM↔SRAM data movement and making long-context training practical.

FlashAttention: Fast and Memory-Efficient Exact Attention

2023

Mixture-of-Experts mainstream (Mixtral 8x7B)

Mistral AI releases Mixtral — an open-weight MoE Transformer matching GPT-3.5 at a fraction of inference cost.

2024

Reasoning models (o1, DeepSeek-R1)

breakthrough

OpenAI o1 and DeepSeek-R1 introduce test-time compute scaling via long chain-of-thought built on Transformer — a new axis of scaling.

GPU Tensor CoresPRIMARY

Dense matmul (QK^T, attention·V, FFN) maps perfectly to Tensor Cores. NVIDIA H100/H200/B200 with FP8/FP16/BF16 are the de facto standard for Transformer training and inference.

TPUPRIMARY

Google TPU v4/v5/v6 were designed with Transformers in mind (PaLM, Gemini). XLA + JAX/Flax deliver high performance at large scale.

CPU AVXLIMITED

Inference of small Transformers (e.g., BERT-base) on CPU AVX-512/AMX is feasible (Intel oneDNN, llama.cpp), but latency scales linearly with sequence length.

FPGAPOSSIBLE

FPGAs are used in specialized low-latency applications (HFT, edge), but HBM memory limits and toolchain constraints make them a niche choice.

Related AI models

TabPFN

Other

Ti0

Title	Publisher	Type
Attention Is All You Need (Vaswani et al., 2017)	—	scientific article
The Illustrated Transformer (Jay Alammar)	—	blog
The Annotated Transformer (Harvard NLP)	—	documentation

Attention Is All You Need (Vaswani et al., 2017)

scientific article

The Illustrated Transformer (Jay Alammar)

blog

The Annotated Transformer (Harvard NLP)

documentation

Back to technology catalog

Transformer

Use cases

How it works

Problem solved

Main components

Multi-Head Self-Attention (MHSA)

FFN / MLP Block

LayerNorm / RMSNorm

Residual / skip connections

Positional Encoding (PE)

Computational complexity

Evaluation

Strengths

Limitations

Implementation

Common pitfalls

History and evolution

Preferred hardware

Related models and families

Related AI models

TabPFN

Other

Sources