Architecture

LongCat Sparse Attention

2025ActivePublished: 29 June 2026Updated: 29 June 2026Published

Key innovation

A sparse-attention variant designed by Meituan for long contexts — enables training and inference of models with a 1M-token window without quadratic memory blow-up; introduced in LongCat-2.0 (December 2025).

Problem solved

Standard transformer attention has quadratic complexity (O(n²)) in context length, which makes a native 1M window practically infeasible (~10¹² operations per layer). Sparse-attention approximations have existed in the literature for years (Longformer, BigBird, Sparse Transformers), but they have rarely been used to train frontier models — LongCat Sparse Attention shows that a sparse variant can be used in full-scale pre-training of a 1.6T model without quality loss.

Implementation

Reference implementations

meituan-longcat/LongCat-2.0 (HuggingFace)

Python (transformers / pytorch) · Meituan LongCat Team

Official

Evolution

Original paper · 2025 · LongCat Tech Blog (December 2025); a detailed paper has not yet been released on arXiv (as of December 2025). · Meituan LongCat Team

Introducing LongCat-2.0 (LongCat Tech Blog)

Meituan LongCat Team

2025

Introduction of LongCat Sparse Attention in LongCat-2.0

Inflection point

The mechanism was first presented together with the LongCat-2.0 model (December 2025). A 1.6T MoE model with ~48B active parameters per token, a 1M context, fully trained on Chinese AI ASIC superpods.

Introducing LongCat-2.0 (paper)

Sources

Introducing LongCat-2.0 (LongCat Tech Blog)

Blog

Meituan LongCat

meituan-longcat/LongCat-2.0 (HuggingFace model card)

Repository

HuggingFace