In standard self-attention the attention matrix has shape [T, T] — every query i can take any key j in [0, T-1] (or [0, i] in the causal variant). In SWA each query i can only take keys from the window [i-W+1, i] (causal) or [i-W/2, i+W/2] (symmetric). Other positions are masked (logit = -∞ before softmax), zeroing their contribution. The effective receptive field grows linearly with depth: after L layers each token "sees" approximately L · W tokens back — for Mistral 7B (L=32, W=4096) that is 131 072 tokens of effective context even though a single layer looks at 4096. Implementation-wise SWA is fused with FlashAttention — the full [T,T] matrix is never materialised, only [T, W] matrices. The KV cache stores only the last W tokens per layer (Mistral) or the last min(T, W) (Longformer for bidirectional). This reduces KV cache memory from linear in T to linear in min(T, W).
Standard self-attention scales quadratically with sequence length T: O(T²·d) compute cost, O(T²) activation memory. For T = 32k–128k these memory and compute requirements are infeasible on a single GPU. At the same time, most tokens in typical long texts depend strongly on a few hundred neighbours rather than on all others. SWA exploits this observation: instead of modelling attention between all pairs, it restricts attention to a local window and lets information propagate further through model depth.
Deterministic binary mask defining which (query, key) pairs are allowed. Applied before softmax (logits outside the window = -∞).
Official
KV buffer holding only the last W tokens per layer. New tokens overwrite the oldest in a FIFO rotation. Crucial for keeping memory at O(W) instead of O(T).
Official
Selected tokens ([CLS], question tokens in QA, topic tokens) can attend to the whole sequence and be visible to all others. Enriches global coherence at O(g·T) cost for these tokens.
Official
A naive SWA implementation builds the full [T, T] matrix and zeros positions outside the window. It cancels all the memory savings — still O(T²). A common bug in simple implementations.
In SWA only the last W tokens per layer need to be kept in KV — not the entire history. Keeping full KV negates SWA's main memory benefit (e.g. for Mistral 7B at T=32k vs W=4096 it's ~8× difference).
The L·W receptive field is an UPPER theoretical bound, not a guarantee. Practical ability to precisely retrieve a fact from a distant position (NIAH) is usually significantly worse than in a full-attention model with comparable context length.
In long-document encoders (Longformer for QA, classification) without global attention on key tokens ([CLS], question tokens) quality drops — the local window is not enough to gather a global representation.
OpenAI publishes Sparse Transformer — the first widely cited work on sparsifying attention via deterministic masks (local + strided). A direct precursor of SWA.
Beltagy, Peters, Cohan (AI2) publish Longformer. They give the first full formalisation of SWA for long-document encoders and the "SWA + global attention" variant on selected tokens. The first long-context encoder competitive with BERT on short tasks.
Google publishes BigBird — sparse attention combining a local window (SWA), random attention, and global tokens. It theoretically shows that this combination retains the full expressivity of a standard Transformer.
Mistral AI releases Mistral 7B with causal SWA of W=4096 in every layer. The first widely adopted open-source LLM based entirely on SWA. It shows that the L·W receptive field (32 × 4096 = 131k) is sufficient for high-quality long-context.
FlashAttention v2 / v3 natively support sliding window — fused SWA kernels that never materialise the full [T, T] matrix. The practical implementation standard.
Google DeepMind introduces in Gemma 2/3 an alternating architecture: some layers are SWA, some are full attention. The argument: SWA cheaply provides local coherence while full attention every few layers recovers global dependencies.
Time complexity: O(T · W · d) per warstwa (zamiast O(T² · d) w full attention). Space complexity: O(W · d) per warstwa (KV cache) + O(T · W · d) aktywacji w fuzowanej impl..
The main SWA limitation is the linear growth of receptive field with depth. To make the model "see" T tokens it needs at least ceil(T/W) layers. For very long contexts (T = 1M+) model depth itself becomes the limit — hence the hybrid variant with full attention.
Number of tokens visible to each query. The most important SWA parameter — a trade-off between effective context (L·W) and cost (O(T·W·d)).
In autoregressive models (LLMs) the window is causal ([i-W+1, i]). In encoders (Longformer, BERT-like) the window is symmetric ([i-W/2, i+W/2]).
Whether all layers use SWA (Mistral) or SWA is interleaved with full-attention layers (Gemma 2/3). The interleaved variant recovers some global coherence at the cost of extra memory/compute.
Selected tokens (e.g. [CLS], the question in QA) can attend to the whole sequence and be visible to all others — the Longformer "SWA + global" variant. Enriches global coherence.
A "dilated SWA" variant — window with gaps (like dilated convolution). Lets you grow the receptive field without growing W. Used occasionally, mainly in encoders.
SWA is sparse attention with deterministic geometric sparsity (not learned structured sparsity like Mixture-of-Experts). Inside the window attention is dense, outside zero. Secondary "dense" mode is listed because many models (Gemma 2/3, Mistral) interleave SWA layers with full-attention layers.
No learned routing — sparsification is static (a geometric window mask).
SWA only modifies the attention mask — remains fully parallel within a layer. Fused implementations (FlashAttention with sliding window) operate on [T, W] blocks instead of [T, T], which additionally improves GPU cache utilisation.
SWA is designed for fused GPU kernels (FlashAttention with `window_size`). Local [T, W] blocks fit perfectly in tensor core SRAM, yielding significant speedup over dense attention.
The idea itself (deterministic window mask) works on any hardware supporting standard attention. The actual memory benefit, however, depends on a fused kernel.
llama.cpp implements SWA on CPU AVX — works, although production use is much smaller than on GPU.