MiniMax — the Chinese AI model company known for its M2 series — published a detailed technical report on May 27, 2026, and simultaneously teased its next generation: the M3 model with a new MiniMax Sparse Attention (MSA) mechanism. Early hardware profiling shows MSA delivers a 15.6× speedup in the decoding phase at one-million-token context lengths compared to the full attention used in M2.
Key takeaways
- MiniMax M3 introduces "MiniMax Sparse Attention" (MSA) — block-level selection on real, uncompressed KV pairs
- Decoding speedup at 1M tokens: 15.6×; prefilling speedup: 9.7× — both versus M2's full attention
- M2 uses 229.9 billion total parameters; activates 9.8 billion per token via 256 fine-grained experts (MoE)
- M2.7 handled 30–50% of its own ML development workflow; scored 66.6% medal rate on MLE Bench Lite
- MiniMax claims MSA solves the core weakness of sub-quadratic methods: degraded multi-hop reasoning
Trapped in the quadratic
Every large language model faces the same wall: standard full attention requires every token to compute its relationship with every other token. Cost grows quadratically with sequence length — at one million tokens, that is computationally prohibitive.
Sub-quadratic alternatives — sliding window attention, linear attention — reduce cost but historically degrade multi-hop reasoning. MiniMax's M2 testing data is specific and shown in Table 1: the 18-point gap on the RULER 128K benchmark was decisive — SWA variants were dropped, and M2 shipped with full attention, absorbing the full compute cost.
| M2 attention variant | RULER 128K score | Compute cost |
|---|---|---|
| Full attention (shipped to production) | 90.0 | Quadratic in context length |
| Sliding Window Attention (>32K) | 72.0 | Sub-quadratic |
| Gap | −18 pts | — |
MSA: block-level selection on real KV
The upcoming M3 architecture breaks that tradeoff. MiniMax Sparse Attention (MSA) does not compress keys and values into a low-dimensional latent space — as DeepSeek's MLA does — but instead operates on a standard GQA backbone, dynamically selecting block-level sequences from real, uncompressed KV pairs.
This distinction matters for two reasons. First, it eliminates precision loss from compression. Second, it enables native prefix caching — a feature whose absence blocked earlier sub-quadratic methods and complicated integration with Multi-Token Prediction (MTP) for speculative decoding.
| Property | M2 (full attention) | M3 (MiniMax Sparse Attention) |
|---|---|---|
| Attention backbone | Full attention (GQA) | GQA + block-level KV selection |
| KV cache | Uncompressed | Uncompressed (vs. DeepSeek MLA) |
| Total parameters | 229.9B | Not disclosed |
| Active parameters / token | 9.8B (256 MoE experts) | Not disclosed |
| Prefilling at 1M tokens | Baseline | 9.7× faster |
| Decoding at 1M tokens | Baseline | 15.6× faster |
| Prefix caching | Yes | Yes (native) |
| Multi-hop reasoning | Preserved | Preserved (per MiniMax claim) |
Table 2 compares both generations. The bottom row is the headline — a 15.6× decoding speedup at one million tokens. The decoding phase is the bottleneck of every text generation: the model recalculates context for all prior output at each new token. A speedup of this magnitude means long agentic outputs — multi-step task results, multi-page summaries — become economically viable to generate in real time.
Forge and the self-improving M2.7
The M2 report also reveals the architecture of MiniMax's agent training system. The company built "Forge" — an RL environment split into three independent modules: Agent Side, middleware abstraction layer (Gateway Server and Data Pool), and Training/Inference engines.
Two key engineering solutions inside Forge: windowed FIFO scheduling (a sliding-window scheduler that prevents cluster idle time and gradient oscillation) and prefix tree merging (grouping completions sharing identical prefixes into a single forward pass, yielding up to 40× training speedup with zero approximation error).
The result of Forge training is model MiniMax M2.7, which MiniMax says autonomously handles 30–50% of its own ML development pipeline. On MLE Bench Lite — a benchmark testing autonomous machine learning research capability — M2.7 scored a 66.6% medal rate across 24-hour independent trials, effectively matching Google's closed-weight Gemini 3.1 Pro. In the open-source market, MiniMax competes with Xiaomi for dominance in the agentic model segment.
Why this matters
MiniMax M3 matters for reasons that go beyond a single benchmark. If MSA genuinely delivers 15.6× decoding speedup without degrading multi-hop reasoning, it breaks a tradeoff that has constrained agentic applications at long contexts for years. The cost of inference at one million tokens drops dramatically — making agentic infrastructure viable for enterprises that currently cannot afford full attention at long sequences. The M2 report is also a rare example of technical transparency from a Chinese AI vendor: it documents not only successes but dead ends — rejected sub-quadratic architectures, expert load-balancing problems. For AI engineers building their own models, it is a free roadmap for avoiding costly mistakes. Competitively, a successful M3 would strengthen MiniMax's position in the open-source agentic model segment.
What's next
- MiniMax announced a technical blog post detailing MSA — its publication will be the first real test of the claimed speedup numbers
- The M2 technical report is already available on Hugging Face — developers can independently verify the described Forge and MTP results
- Full M3 launch has no announced date; the company teased "Something BIG is coming" — a concrete timeline will show whether the schedule matches the architectural ambitions
Sources
- VentureBeat — MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost
- Hugging Face Papers — MiniMax M2 Technical Report




