Robots Atlas>ROBOTS ATLAS
Evaluation

CAS

2026ResearchPublished
Key innovation
Metric evaluating search agents by the ratio of valid information returns to tool calls — combining accuracy and efficiency into a single measure.
Category
Evaluation
Abstraction level
Primitive
Operation level
Evaluation (runtime)
Use cases
Evaluating search agent efficiencyComparing parallel vs sequential search strategiesAssessing operational cost of AI agents in productionOptimizing agents for API call minimization

How it works

CAS = Σ(valid information returns) / Σ(tool calls) across all benchmark instances. Higher CAS means an agent that uses tools efficiently — answering correctly with minimal calls. The metric penalizes both incorrect answers (zero returns) and redundant calls (denominator grows).

Problem solved

Existing agent metrics measure only accuracy — an agent answering correctly after 12 rounds is treated the same as one answering after 3. CAS introduces the efficiency dimension: how much useful information the agent returns per tool call.

Implementation

Implementation pitfalls
Binary correctness scoring ignores partial answersMedium

CAS = valid returns / tool calls assumes binary scoring. A partially correct answer (e.g. 4/6 entities) counts as 0, which may understate CAS for models giving good but incomplete results.

Metric does not penalize high latency per callMedium

CAS counts tool call count, not their duration. An agent making 3 very slow calls may have higher CAS than one making 6 fast calls, despite higher total response time.

Hard to compare across benchmarks with different tool definitionsMedium

The granularity of one tool call varies across systems — one UGS call may correspond to 6 calls in a sequential system. CAS without tool granularity normalization favors architectures with coarser tools.