Tag: Multi token prediction

Aktualności28 maja 2026

MiniMax M3: sparse attention architecture and 15.6× faster decoding

MiniMax published a technical report on its M2 series and announced M3 — a model with a new sparse attention mechanism (MSA) that decodes 15.6 times faster than M2 at one-million-token context lengths. It is the first sub-quadratic architecture the company says preserves multi-hop reasoning without compromise.

Aktualności6 maja 2026

Google boosts Gemma 4 inference up to 3x with speculative decoding

On May 6, 2026, Google released experimental Multi-Token Prediction (MTP) drafter models for the Gemma 4 family, accelerating local inference up to three times with no loss of output quality. The technique is based on speculative decoding: a lightweight draft model predicts future tokens, which are then verified in parallel by the main model.