MiniMax M3: sparse attention architecture and 15.6× faster decoding

MiniMax — the Chinese AI model company known for its M2 series — published a detailed technical report on May 27, 2026, and simultaneously teased its next generation: the M3 model with a new MiniMax Sparse Attention (MSA) mechanism. Early hardware profiling shows MSA delivers a 15.6× speedup in the decoding phase at one-million-token context lengths compared to the full attention used in M2.

Key takeaways

MiniMax M3 introduces "MiniMax Sparse Attention" (MSA) — block-level selection on real, uncompressed KV pairs
Decoding speedup at 1M tokens: 15.6×; prefilling speedup: 9.7× — both versus M2's full attention
M2 uses 229.9 billion total parameters; activates 9.8 billion per token via 256 fine-grained experts (MoE)
M2.7 handled 30–50% of its own ML development workflow; scored 66.6% medal rate on MLE Bench Lite
MiniMax claims MSA solves the core weakness of sub-quadratic methods: degraded multi-hop reasoning

Trapped in the quadratic

Every large language model faces the same wall: standard full attention requires every token to compute its relationship with every other token. Cost grows quadratically with sequence length — at one million tokens, that is computationally prohibitive.

Sub-quadratic alternatives — sliding window attention, linear attention — reduce cost but historically degrade multi-hop reasoning. MiniMax's M2 testing data is specific and shown in Table 1: the 18-point gap on the RULER 128K benchmark was decisive — SWA variants were dropped, and M2 shipped with full attention, absorbing the full compute cost.

M2 attention variant	RULER 128K score	Compute cost
Full attention (shipped to production)	90.0	Quadratic in context length
Sliding Window Attention (>32K)	72.0	Sub-quadratic
Gap	−18 pts	—

MSA: block-level selection on real KV

The upcoming M3 architecture breaks that tradeoff. MiniMax Sparse Attention (MSA) does not compress keys and values into a low-dimensional latent space — as DeepSeek's MLA does — but instead operates on a standard GQA backbone, dynamically selecting block-level sequences from real, uncompressed KV pairs.

This distinction matters for two reasons. First, it eliminates precision loss from compression. Second, it enables native prefix caching — a feature whose absence blocked earlier sub-quadratic methods and complicated integration with Multi-Token Prediction (MTP) for speculative decoding.

Property	M2 (full attention)	M3 (MiniMax Sparse Attention)
Attention backbone	Full attention (GQA)	GQA + block-level KV selection
KV cache	Uncompressed	Uncompressed (vs. DeepSeek MLA)
Total parameters	229.9B	Not disclosed
Active parameters / token	9.8B (256 MoE experts)	Not disclosed
Prefilling at 1M tokens	Baseline	9.7× faster
Decoding at 1M tokens	Baseline	15.6× faster
Prefix caching	Yes	Yes (native)
Multi-hop reasoning	Preserved	Preserved (per MiniMax claim)

Table 2 compares both generations. The bottom row is the headline — a 15.6× decoding speedup at one million tokens. The decoding phase is the bottleneck of every text generation: the model recalculates context for all prior output at each new token. A speedup of this magnitude means long agentic outputs — multi-step task results, multi-page summaries — become economically viable to generate in real time.

Forge and the self-improving M2.7

The M2 report also reveals the architecture of MiniMax's agent training system. The company built "Forge" — an RL environment split into three independent modules: Agent Side, middleware abstraction layer (Gateway Server and Data Pool), and Training/Inference engines.

Two key engineering solutions inside Forge: windowed FIFO scheduling (a sliding-window scheduler that prevents cluster idle time and gradient oscillation) and prefix tree merging (grouping completions sharing identical prefixes into a single forward pass, yielding up to 40× training speedup with zero approximation error).

The result of Forge training is model MiniMax M2.7, which MiniMax says autonomously handles 30–50% of its own ML development pipeline. On MLE Bench Lite — a benchmark testing autonomous machine learning research capability — M2.7 scored a 66.6% medal rate across 24-hour independent trials, effectively matching Google's closed-weight Gemini 3.1 Pro. In the open-source market, MiniMax competes with Xiaomi for dominance in the agentic model segment.

Why this matters

MiniMax M3 matters for reasons that go beyond a single benchmark. If MSA genuinely delivers 15.6× decoding speedup without degrading multi-hop reasoning, it breaks a tradeoff that has constrained agentic applications at long contexts for years. The cost of inference at one million tokens drops dramatically — making agentic infrastructure viable for enterprises that currently cannot afford full attention at long sequences. The M2 report is also a rare example of technical transparency from a Chinese AI vendor: it documents not only successes but dead ends — rejected sub-quadratic architectures, expert load-balancing problems. For AI engineers building their own models, it is a free roadmap for avoiding costly mistakes. Competitively, a successful M3 would strengthen MiniMax's position in the open-source agentic model segment.

What's next

MiniMax announced a technical blog post detailing MSA — its publication will be the first real test of the claimed speedup numbers
The M2 technical report is already available on Hugging Face — developers can independently verify the described Forge and MTP results
Full M3 launch has no announced date; the company teased "Something BIG is coming" — a concrete timeline will show whether the schedule matches the architectural ambitions