Robotics

MSAT

2026ExperimentalPublished

Key innovation

Integrates heterogeneous robotic modalities (vision, language, proprioception, tactile sensing, motor signals) as separate modality-specific token streams inside a single transformer, fused via cross-modal joint self-attention — allowing a VLA policy to jointly learn broad scene understanding together with narrow functional capabilities (motion awareness, long-term memory, physical sensing) without pipeline engineering compromises.

How it works

The architecture maintains multiple processing streams (e.g., visual tokens, state tokens, action tokens), each processed with stream-specific attention patterns. Cross-stream attention modules fuse information between streams at designated layers.

Problem solved

Standard transformers process all tokens in a single stream, which is suboptimal for robot control where visual, proprioceptive, and action tokens have very different structures. MSAT addresses this with specialized parallel streams.