The architecture maintains multiple processing streams (e.g., visual tokens, state tokens, action tokens), each processed with stream-specific attention patterns. Cross-stream attention modules fuse information between streams at designated layers.
Standard transformers process all tokens in a single stream, which is suboptimal for robot control where visual, proprioceptive, and action tokens have very different structures. MSAT addresses this with specialized parallel streams.