Robots Atlas>ROBOTS ATLAS
Training

TRACE

2026ResearchPublished
Key innovation
Trajectory-level reward penalizing tool-call count, with an efficiency threshold tightened monotonically across training epochs — forcing the model to learn progressively shorter trajectories for the same task.
Category
Training
Abstraction level
Building block
Operation level
Post-trainingTraining
Use cases
Training multimodal agents with multiple tool calls (HyperEyes)Optimizing inference cost in production agentsTraining reasoning models that limit Chain-of-Thought lengthRL for tool-augmented agents (web search, code interpreter)Reinforcement Fine-Tuning with compute budget constraints

How it works

After each rollout τ: (1) accuracy evaluation acc ∈ {0,1}; (2) tool-call count t_s = TurnCount(τ); (3) if acc==1: R_trace = R⁺ − λ_t · t_s (correctness bonus minus per-call penalty); if acc==0: R_trace = R⁻ (constant penalty); (4) at epoch end the threshold l_t updates to min of current l_t and minimum step count among correct trajectories in the epoch. The "raising the bar" mechanism progressively tightens the efficiency criterion — the model cannot rest on its achieved level.

Problem solved

Classic reward shaping in RL rewards only answer correctness — there is no signal limiting tool-call count. The agent may learn to answer correctly but after an excessive number of rounds, generating high inference costs in production. TRACE introduces efficiency as a first-class training objective, and the adaptive threshold prevents optimization stagnation.

Components

Trajectory accuracy evaluator

Function evaluating final answer correctness after full rollout (binary 0/1).

Tool-call counter

Tool-call counter in the trajectory — directly affects the penalty.

Adaptive threshold l_t

Efficiency threshold tightened monotonically across epochs: l_t ← min(l_t, min(T_tol)).

Penalty coefficient λ_t

Weight coefficient of per-tool-call penalty — calibrates accuracy vs. efficiency balance.

Implementation

Reference implementations
Implementation pitfalls
Overly aggressive threshold tightening can halt learningMedium

If l_t drops too fast the model may fail to learn alternative shorter trajectories and stagnate on non-zero penalty. Requires warm-up for l_t.

Sensitivity to λ_t selectionMedium

Too small λ_t → tool-call penalty is negligible, model does not minimize steps. Too large → model avoids tools even when necessary, accuracy drops.

Reward hacking — model learns to refuse hard queriesMedium

With very high tool-call penalty the model may prefer a quick incorrect answer (R⁻) over a long correct one. Requires R⁻ to be clearly lower than any correct trajectory.

Execution paradigm

Primary mode
Sparse
Activation pattern
Stage dependent

Parallelism

Parallelism level
Fully parallel
Scope
TrainingAcross devices

Hardware requirements

TRACE as part of the RL pipeline requires GPU for parallel agent rollouts — many trajectories at once for stable policy gradient estimation.