Memory Sparse Attention Layer
Replaces the standard full-attention mechanism in the upper Transformer layers; for each query it selects the top-k documents from a compressed memory bank and appends their K/V pairs to the local context.
The core attention layer that replaces full attention in upper transformer layers. For each query, a routing projector computes cosine similarity against all stored routing keys (Kᵣ), selects the top-k most relevant document blocks, and concatenates their compressed K/V with the local short-context K/V for standard autoregressive decoding. Lower layers retain independent per-document attention for hierarchical alignment.