Training

TRACE

2026ResearchPublished

Key innovation

Trajectory-level reward penalizing tool-call count, with an efficiency threshold tightened monotonically across training epochs — forcing the model to learn progressively shorter trajectories for the same task.

How it works

After each rollout τ: (1) accuracy evaluation acc ∈ {0,1}; (2) tool-call count t_s = TurnCount(τ); (3) if acc==1: R_trace = R⁺ − λ_t · t_s (correctness bonus minus per-call penalty); if acc==0: R_trace = R⁻ (constant penalty); (4) at epoch end the threshold l_t updates to min of current l_t and minimum step count among correct trajectories in the epoch. The "raising the bar" mechanism progressively tightens the efficiency criterion — the model cannot rest on its achieved level.

Problem solved

Classic reward shaping in RL rewards only answer correctness — there is no signal limiting tool-call count. The agent may learn to answer correctly but after an excessive number of rounds, generating high inference costs in production. TRACE introduces efficiency as a first-class training objective, and the adaptive threshold prevents optimization stagnation.

Components

Trajectory accuracy evaluator

Function evaluating final answer correctness after full rollout (binary 0/1).

Tool-call counter

Tool-call counter in the trajectory — directly affects the penalty.

Adaptive threshold l_t

Efficiency threshold tightened monotonically across epochs: l_t ← min(l_t, min(T_tol)).

Penalty coefficient λ_t

Weight coefficient of per-tool-call penalty — calibrates accuracy vs. efficiency balance.

Implementation

Reference implementations

HyperEyes

Python

Official

Implementation pitfalls

Overly aggressive threshold tightening can halt learningMedium

If l_t drops too fast the model may fail to learn alternative shorter trajectories and stagnate on non-zero penalty. Requires warm-up for l_t.

Sensitivity to λ_t selectionMedium

Too small λ_t → tool-call penalty is negligible, model does not minimize steps. Too large → model avoids tools even when necessary, accuracy drops.

Reward hacking — model learns to refuse hard queriesMedium

With very high tool-call penalty the model may prefer a quick incorrect answer (R⁻) over a long correct one. Requires R⁻ to be clearly lower than any correct trajectory.