Standard transformer attention has quadratic complexity (O(n²)) in context length, which makes a native 1M window practically infeasible (~10¹² operations per layer). Sparse-attention approximations have existed in the literature for years (Longformer, BigBird, Sparse Transformers), but they have rarely been used to train frontier models — LongCat Sparse Attention shows that a sparse variant can be used in full-scale pre-training of a 1.6T model without quality loss.
The mechanism was first presented together with the LongCat-2.0 model (December 2025). A 1.6T MoE model with ~48B active parameters per token, a 1M context, fully trained on Chinese AI ASIC superpods.