Architecture

SWA

2020ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Replaces full self-attention O(T²) with a fixed-width local window W: each token "sees" only W previous tokens (or W/2 on each side) instead of the whole sequence. The receptive field grows linearly with model depth (L layers × W), allowing sequences much longer than W at O(T·W·d) cost.

How it works

In standard self-attention the attention matrix has shape [T, T] — every query i can take any key j in [0, T-1] (or [0, i] in the causal variant). In SWA each query i can only take keys from the window [i-W+1, i] (causal) or [i-W/2, i+W/2] (symmetric). Other positions are masked (logit = -∞ before softmax), zeroing their contribution. The effective receptive field grows linearly with depth: after L layers each token "sees" approximately L · W tokens back — for Mistral 7B (L=32, W=4096) that is 131 072 tokens of effective context even though a single layer looks at 4096. Implementation-wise SWA is fused with FlashAttention — the full [T,T] matrix is never materialised, only [T, W] matrices. The KV cache stores only the last W tokens per layer (Mistral) or the last min(T, W) (Longformer for bidirectional). This reduces KV cache memory from linear in T to linear in min(T, W).

Problem solved

Standard self-attention scales quadratically with sequence length T: O(T²·d) compute cost, O(T²) activation memory. For T = 32k–128k these memory and compute requirements are infeasible on a single GPU. At the same time, most tokens in typical long texts depend strongly on a few hundred neighbours rather than on all others. SWA exploits this observation: instead of modelling attention between all pairs, it restricts attention to a local window and lets information propagate further through model depth.

Components

Sliding window maskGeometric gate restricting attention to a local window

Deterministic binary mask defining which (query, key) pairs are allowed. Applied before softmax (logits outside the window = -∞).

INMatrix of allowed token pairs. In the causal LLM version: lower triangular limited to W columns back.

OUTEach query receives only W keys from the left (or W/2 from each side in symmetric).

Causal SWA (LLM)Window [i-W+1, i] — autoregressive models like Mistral, Mixtral, Gemma.

Symmetric SWA (encoder)Window [i-W/2, i+W/2] — long-document encoders like Longformer, BigBird.

Dilated SWAWindow with gaps — larger receptive field at the same W. Used occasionally.

Official

Rolling KV cacheEnforces SWA's effective memory savings in production inference

KV buffer holding only the last W tokens per layer. New tokens overwrite the oldest in a FIFO rotation. Crucial for keeping memory at O(W) instead of O(T).

Official

Optional: global tokens (Longformer-style)Global "gate" for tasks requiring broader context

Selected tokens ([CLS], question tokens in QA, topic tokens) can attend to the whole sequence and be visible to all others. Enriches global coherence at O(g·T) cost for these tokens.

Official

Implementation

Reference implementations

allenai/longformer (official repo)

Python (PyTorch) · Allen Institute for AI (Beltagy et al.)

Official

mistralai/mistral-inference

Python (PyTorch) · Mistral AI

Official

FlashAttention — sliding_window arg

CUDA / Python · Tri Dao et al.

Official

vLLM — sliding window support

Python / CUDA · vLLM project

Implementation pitfalls

Materialising the full attention matrix before maskingHigh

A naive SWA implementation builds the full [T, T] matrix and zeros positions outside the window. It cancels all the memory savings — still O(T²). A common bug in simple implementations.

Fix:Use fused kernels (FlashAttention with `window_size=(W,0)`) that natively operate on [T, W] blocks.

Wrong KV cache size in long-contextHigh

In SWA only the last W tokens per layer need to be kept in KV — not the entire history. Keeping full KV negates SWA's main memory benefit (e.g. for Mistral 7B at T=32k vs W=4096 it's ~8× difference).

Fix:Keep only the last W tokens per layer in the KV cache (Mistral "rolling KV cache" pattern, vLLM PagedAttention with rotation).

Treating L·W as the hard effective contextMedium

The L·W receptive field is an UPPER theoretical bound, not a guarantee. Practical ability to precisely retrieve a fact from a distant position (NIAH) is usually significantly worse than in a full-attention model with comparable context length.

Fix:For long-context retrieval tasks (precise lookup, NIAH) prefer SWA + full hybrids (Gemma 2/3) or full-attention models with RoPE extension (YaRN/LongRoPE).

Missing global tokens in long-document classificationMedium

In long-document encoders (Longformer for QA, classification) without global attention on key tokens ([CLS], question tokens) quality drops — the local window is not enough to gather a global representation.

Fix:Enable global attention on canonical task tokens (Longformer attention_mask with global=True).

Evolution

Original paper · 2020 · arXiv:2004.05150 (Allen Institute for AI) · Iz Beltagy

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, Arman Cohan

2019

Sparse Transformer (Child et al., OpenAI)

OpenAI publishes Sparse Transformer — the first widely cited work on sparsifying attention via deterministic masks (local + strided). A direct precursor of SWA.

2020

Longformer — formalisation of SWA + global tokens

Inflection point

Beltagy, Peters, Cohan (AI2) publish Longformer. They give the first full formalisation of SWA for long-document encoders and the "SWA + global attention" variant on selected tokens. The first long-context encoder competitive with BERT on short tasks.

Longformer: The Long-Document Transformer (paper)

2020

BigBird (Zaheer et al., Google)

Google publishes BigBird — sparse attention combining a local window (SWA), random attention, and global tokens. It theoretically shows that this combination retains the full expressivity of a standard Transformer.

2023

Mistral 7B — SWA in an autoregressive LLM

Inflection point

Mistral AI releases Mistral 7B with causal SWA of W=4096 in every layer. The first widely adopted open-source LLM based entirely on SWA. It shows that the L·W receptive field (32 × 4096 = 131k) is sufficient for high-quality long-context.

2023

FlashAttention with sliding window

FlashAttention v2 / v3 natively support sliding window — fused SWA kernels that never materialise the full [T, T] matrix. The practical implementation standard.

FlashAttention (concept)

2024

Gemma 2 / Gemma 3 — SWA + full hybrid

Google DeepMind introduces in Gemma 2/3 an alternating architecture: some layers are SWA, some are full attention. The argument: SWA cheaply provides local coherence while full attention every few layers recovers global dependencies.

SWA

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements