After each rollout τ: (1) accuracy evaluation acc ∈ {0,1}; (2) tool-call count t_s = TurnCount(τ); (3) if acc==1: R_trace = R⁺ − λ_t · t_s (correctness bonus minus per-call penalty); if acc==0: R_trace = R⁻ (constant penalty); (4) at epoch end the threshold l_t updates to min of current l_t and minimum step count among correct trajectories in the epoch. The "raising the bar" mechanism progressively tightens the efficiency criterion — the model cannot rest on its achieved level.
Classic reward shaping in RL rewards only answer correctness — there is no signal limiting tool-call count. The agent may learn to answer correctly but after an excessive number of rounds, generating high inference costs in production. TRACE introduces efficiency as a first-class training objective, and the adaptive threshold prevents optimization stagnation.
Function evaluating final answer correctness after full rollout (binary 0/1).
Tool-call counter in the trajectory — directly affects the penalty.
Efficiency threshold tightened monotonically across epochs: l_t ← min(l_t, min(T_tol)).
Weight coefficient of per-tool-call penalty — calibrates accuracy vs. efficiency balance.
If l_t drops too fast the model may fail to learn alternative shorter trajectories and stagnate on non-zero penalty. Requires warm-up for l_t.
Too small λ_t → tool-call penalty is negligible, model does not minimize steps. Too large → model avoids tools even when necessary, accuracy drops.
With very high tool-call penalty the model may prefer a quick incorrect answer (R⁻) over a long correct one. Requires R⁻ to be clearly lower than any correct trajectory.
TRACE as part of the RL pipeline requires GPU for parallel agent rollouts — many trajectories at once for stable policy gradient estimation.