CAS
How it works
CAS = Σ(valid information returns) / Σ(tool calls) across all benchmark instances. Higher CAS means an agent that uses tools efficiently — answering correctly with minimal calls. The metric penalizes both incorrect answers (zero returns) and redundant calls (denominator grows).
Problem solved
Existing agent metrics measure only accuracy — an agent answering correctly after 12 rounds is treated the same as one answering after 3. CAS introduces the efficiency dimension: how much useful information the agent returns per tool call.
Implementation
CAS = valid returns / tool calls assumes binary scoring. A partially correct answer (e.g. 4/6 entities) counts as 0, which may understate CAS for models giving good but incomplete results.
CAS counts tool call count, not their duration. An agent making 3 very slow calls may have higher CAS than one making 6 fast calls, despite higher total response time.
The granularity of one tool call varies across systems — one UGS call may correspond to 6 calls in a sequential system. CAS without tool granularity normalization favors architectures with coarser tools.