Inference

FlashAttention

2022ActivePublished: 29 May 2026Updated: 29 May 2026Published

Key innovation

IO-aware algorithm for computing exact self-attention that minimizes transfers between GPU HBM and SRAM via tiling and online softmax — speeds up attention 2-4× and reduces memory from O(n²) to O(n) without approximation.

How it works

The algorithm splits Q, K, V matrices into blocks of size B_r × d and B_c × d that fit in SRAM (typically 100-200 kB per SM). For each Q block it loads it once, then iterates over K and V blocks, computing partial attention results and accumulating them with numerically stable online softmax: maintaining current max m and sum l, at each new (K_j, V_j) pair updating O ← rescale(O_prev, m_old, m_new) + exp(S_new - m_new) · V_j. The n×n attention matrix is never materialized in HBM. The backward pass uses recomputation instead of saved attention matrix (gradient checkpointing).

Problem solved

The standard attention implementation materializes the n×n matrix in HBM and is memory-bound — the dominant cost is not softmax FLOPs but data transfer. This limits maximum context length and throughput.

Components

Q, K, V tiling

Splitting Q, K, V matrices into blocks fitting in GPU SRAM (typically B_r × d ~ 64-128 × 64-128).

Online softmax

Numerically stable recurrence maintaining running max and exponential sum — allows computing softmax block-wise without materializing the full matrix.

Backward-pass recomputation

Backward doesn't save the attention matrix, recomputing it from saved O, L (logsumexp) — FLOPs vs memory trade-off.

Implementation

Reference implementations

flash-attention (Dao-AILab)

CUDA / Python · Dao AI Lab (Tri Dao)

Official

PyTorch — scaled_dot_product_attention

C++ / CUDA / Python · PyTorch Foundation

Official

xFormers — memory-efficient attention

Python / CUDA · Meta AI

Official

Implementation pitfalls

Version vs hardware mismatchMedium

FlashAttention-3 requires Hopper (H100/H200) — doesn't work on Ampere (A100). v2 is standard on A100. Wrong version = loss of 2-4× speedup.

Fix:Runtime GPU architecture detection; use v2 for A100, v3 for H100. PyTorch sdpa selects automatically.

Limited support for custom attentionLow

FlashAttention assumes standard scaled-dot-product attention with optional causal mask. Non-standard masks (e.g., ALiBi, block-sparse) require special variants or prevent its use.

Fix:Check support for the bias/mask in use. FlexAttention (PyTorch) offers greater flexibility.