MSA (Memory Sparse Attention)

MSA (Memory Sparse Attention) introduces an end-to-end differentiable sparse latent memory layer embedded directly within the Transformer attention mechanism, achieving near-linear O(L) complexity while scaling to 100 million context tokens without an external retrieval system.

Memory Sparse Attention Layer

Replaces the standard full-attention mechanism in the upper Transformer layers; for each query it selects the top-k documents from a compressed memory bank and appends their K/V pairs to the local context.

Modular

The core attention layer that replaces full attention in upper transformer layers. For each query, a routing projector computes cosine similarity against all stored routing keys (Kᵣ), selects the top-k most relevant document blocks, and concatenates their compressed K/V with the local short-context K/V for standard autoregressive decoding. Lower layers retain independent per-document attention for hierarchical alignment.

Router (Routing Projector)

Routing key projection Kᵣ — a compressed document representation used to select top-k documents based on cosine similarity with the current query.

A lightweight projector that maps each document's token-level keys to a compressed routing key Kᵣ (via chunked mean pooling). At inference time, cosine similarity between the query vector and all stored Kᵣ vectors is computed to select the top-k most relevant document blocks. Stored in GPU VRAM for fast scoring.

Document-wise RoPE

Positioning mechanism that resets the position counter to zero at the start of each document, enabling positional extrapolation from short training contexts to 100M inference tokens.

A modified Rotary Position Embedding (RoPE) scheme where positional indices reset to zero at each document boundary (Parallel RoPE). The active query context uses Global RoPE with an offset of k (number of retrieved blocks) to maintain causal order. This decouples positional encoding from global sequence length, enabling zero-shot extrapolation from short training contexts (e.g., 64K tokens) to 100M-token inference without additional training.

KV Cache Memory Store

Hierarchical storage of compressed document latent states: routing keys Kᵣ in GPU VRAM (fast access for scoring), full K/V tensors in CPU RAM (memory efficiency).

A hierarchical key-value memory store that holds compressed document representations. Routing keys (Kᵣ) reside in GPU VRAM for fast similarity scoring. Full K/V tensors are stored in CPU RAM and transferred on-demand for the top-k selected documents. This hierarchical layout enables 100M-token throughput on 2×A800 GPUs using the Memory Parallel inference engine.

Memory Interleave

Multi-step reasoning mechanism that iteratively interleaves document identifier generation and context expansion, enabling multi-hop retrieval across distributed memory chunks.

Modular

A reasoning mechanism that adaptively interleaves three modes: generative retrieval (model generates document IDs), context expansion (retrieved content is appended), and generation (final answer synthesis). This enables multi-hop reasoning across scattered memory fragments that single-round retrieval cannot handle.

Time

…

L = total number of documents in the memory bank; k = number of top-k documents selected per query (constant, small); d = hidden dimension. Router scoring is O(L · d_r) where d_r is the compressed routing key dimension. Near-linear O(L) overall due to constant top-k selection.

Achieves near-linear O(L) complexity compared to O(L²) for full attention. Training complexity is O(n · d) per document block where n is the local context window size. KV cache compression reduces per-document memory footprint.

Memory complexity

…

L = number of documents; d_r = compressed routing key dimension (small); n = per-document token count; d = K/V head dimension. Full K/V tensors stored in CPU RAM; only routing keys in GPU VRAM.

Hierarchical storage design separates fast-access routing keys (GPU VRAM) from bulk K/V tensors (CPU RAM), enabling 100M-token memory banks on 2×A800 GPUs.

Wąskie gardło: Router scoring over entire memory bank

Computing cosine similarity between the query and all stored routing keys Kᵣ scales linearly with the number of documents L in the memory bank. This is the dominant inference bottleneck at very large memory bank sizes (100M tokens).

Parallelism

Partially parallel

Training is parallelizable across documents (each document processed independently in lower layers). Inference uses Memory Parallel engine for distributed router scoring across devices. Top-k selection and subsequent K/V retrieval are sequential within each decoding step.

Paradigm

Sparse

Top-K selected

Only the top-k selected document K/V blocks contribute to attention computation. Routing is applied in upper layers only; lower layers process each document independently (dense, per-document).

Top-k documents

Critical

Number of document blocks retrieved per query step. Controls the precision-cost tradeoff: higher k improves recall at the cost of more K/V transfers and attention computation.

Training context length

Standard

4K tokensMinimum practical training context
64K tokensTraining context used for the MSA-4B model.

The per-document token window size used during training. Due to Document-wise RoPE, models trained on short contexts (e.g., 4K–64K tokens) extrapolate to 100M+ token memory banks at inference without retraining.

Layers with routing

Standard

Routing (MSA layer) is applied only in upper transformer layers; lower layers process documents independently. The split between local and memory-routing layers affects both memory capacity and reasoning depth.

Memory bank size

Standard

1M tokensTypical upper bound on context length in modern long-context LLMs.
100M tokensMaximum validated in the MSA paper on 2×A800 GPUs.

Total number of tokens stored in the long-term memory bank. MSA has been validated up to 100M tokens on 2×A800 GPUs.

Common pitfalls

Cold Start — Building a Memory Bank Before Inference

MEDIUM

MSA requires pre-encoding all documents into compressed latent states (Kᵣ, K, V) before inference can begin. For very large memory banks (100M tokens), this pre-encoding phase requires significant compute and storage management.

Pre-encode documents offline and store the resulting routing keys and K/V tensors. Use chunked mean pooling efficiently. Ensure adequate CPU RAM and GPU VRAM for hierarchical storage before serving.

Incorrect separation of routing and local layers

MEDIUM

MSA applies routing only in upper transformer layers, while lower layers maintain independent per-document processing. Applying routing to too few or too many layers affects the balance between local reasoning and memory retrieval quality.

Follow the layer split configuration specified in the paper and official implementation. Ablate on a held-out long-context validation set when adapting to new backbone architectures.

RoPE inconsistency between training and inference modes

HIGH

Document-wise RoPE requires that positional indices reset at each document boundary during both training and inference. Failure to apply this scheme consistently — or mixing with standard global RoPE — causes positional drift and severe performance degradation at long contexts.

Apply Parallel RoPE (document-level reset) to all memory-bank documents and Global RoPE with k-offset to the active query context, exactly as specified in the paper. Verify implementation against the official codebase.

CPU-GPU bandwidth bottleneck with large memory banks

MEDIUM

On-demand transfer of top-k document K/V tensors from CPU RAM to GPU VRAM during each decoding step can become a latency bottleneck when k is large or K/V tensors are not compressed aggressively enough.

Keep k small (as validated in the paper). Apply aggressive K/V compression (chunked mean pooling) to reduce transfer volume. Use pinned CPU memory and asynchronous prefetching where possible.

Reference implementations

EverMind-AI/MSA (GitHub)official

Python · EverMind

EverMind-AI/MSA-4B (Hugging Face)official

Python · EverMind

GENESIS · Source paper

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

2026NeurIPS 2026Yu Chen, Runkai Chen, Sheng Yi et al.

2026

MSA paper published on arXiv and at NeurIPS 2026 (EverMind)

breakthrough

Yu Chen et al. (EverMind / Shanda Group) submitted 'MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens' to arXiv on March 6, 2026 (arXiv:2603.23516), accepted to NeurIPS 2026. The paper introduces the MSA architecture and demonstrates less than 9% performance degradation scaling from 16K to 100M tokens.

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

2026

Open-source release of MSA code and MSA-4B model on GitHub and Hugging Face

EverMind open-sourced the MSA codebase (github.com/EverMind-AI/MSA) and released the MSA-4B model checkpoint (based on Qwen3-4B-Instruct-2507) on Hugging Face (EverMind-AI/MSA-4B). The repository accumulated over 2,500 GitHub stars within one day of release.

GPU Tensor CoresPRIMARY

MSA requires GPU Tensor Cores for efficient transformer attention computation and router scoring. The validated configuration uses 2×A800 GPUs for 100M-token inference, with routing keys stored in GPU VRAM and full K/V in CPU RAM.

Minimum validated hardware: 2×NVIDIA A800 GPUs (80GB each) for 100M-token inference. The Memory Parallel inference engine is designed for multi-GPU distributed scoring.

Main creators

EverMind

Title	Publisher	Type
EverMind-AI/MSA - GitHub repository	—	—
EverMind official website	—	—
Breaking the 100M Token Limit: EverMind's MSA Architecture Achieves Efficient End-to-End Long-Term Memory for LLMs	—	—