Robots Atlas>ROBOTS ATLAS
TM

TML-Interaction-Small

Small (research preview)
Natively interactive full-duplex 276B MoE model (12B active) from Thinking Machines Lab; processes audio, video and text in 200 ms micro-turns.
⏳ Preview⏳ Limited accessMultimodalAudioSpecialized AI
Parameters
276B (12B active, MoE)
parameters
Release date
11 May 2026
Access:APIDeployment:☁ Cloud

Overview

TML-Interaction-Small is an interaction model unveiled on 11 May 2026 by Thinking Machines Lab as a research preview. It uses a Mixture-of-Experts architecture with 276B total parameters and 12B active parameters. The model processes continuous audio, video and text streams in 200 ms micro-turns, generating text and audio concurrently without external voice-activity-detection components.

The architecture uses encoder-free early fusion: audio is represented as dMel, images are split into 40×40 patches encoded by an hMLP, and the audio decoder uses a flow head. All components are co-trained from scratch with the transformer. On FD-bench V1 the model reaches 0.40 s turn-taking latency, and 43.4% on Audio MultiChallenge APR. The system pairs with an asynchronous background model handling longer reasoning and tool use.

Classification
MultimodalAudioSpecialized AI
Access & deployment
API
Cloud
Weights: Closed
Key parameters
🧩 Parameters: 276B (12B active, MoE)
Tools
📥 Input: text, audio, video

Technical specification

Parameters
276B (12B active, MoE)
parameters
Features:Tool use
Modalities
⬇ Input
textaudiovideo
⬆ Output
textaudio

Capabilities and applications

Native model capabilities
Voice Conversation
Ability to conduct multi-turn real-time voice conversations with context retention and natural speech pacing.
Category: speech
Speech to text
Category: speech
Text to speech
Category: speech
Streaming Speech-to-Text
Real-time conversion of speech to text with immediate output as the speaker is talking.
Category: speech
Live Translation
Real-time speech translation between multiple languages without interrupting the audio stream.
Category: speech
Audio understanding
Category: audio
Video Understanding
Category: video
Multimodal understanding
Category: multimodal
Streaming output
Category: reasoning
Function Calling
Category: planning
Multilingual
Category: language
Reasoning
Category: reasoning

Benchmark results

13 benchmarks
FD-bench V1 (turn-taking latency)
latency
0.40s
📄 Thinking Machines Lab blog (May 2026)
FD-bench V1.5 (average)
average quality
77.8points
📄 Thinking Machines Lab blog (May 2026)
FD-bench V3 (Response Quality)
response quality
82.8%
📄 Thinking Machines Lab blog (May 2026)
Audio MultiChallenge APR
APR
43.4%
📄 Thinking Machines Lab blog (May 2026)
BigBench Audio
accuracy
75.7%
📄 Thinking Machines Lab blog (May 2026)
IFEval (VoiceBench)
accuracy
82.1%
📄 Thinking Machines Lab blog (May 2026)
IFEval (Text)
accuracy
89.7%
📄 Thinking Machines Lab blog (May 2026)
Harmbench
refusal rate
99.0%
📄 Thinking Machines Lab blog (May 2026)
TimeSpeak (internal)
macro accuracy
64.7%
📄 Thinking Machines Lab blog (May 2026)
CueSpeak (internal)
macro accuracy
81.7%
📄 Thinking Machines Lab blog (May 2026)
RepCount-A
off-by-one
35.4%
📄 Thinking Machines Lab blog (May 2026)
ProactiveVideoQA
PAUC@ω=0.5
33.5points
📄 Thinking Machines Lab blog (May 2026)
Charades
mIoU
32.4points
📄 Thinking Machines Lab blog (May 2026)

Technical architecture