Evaluation

CAS

2026ResearchPublished

Key innovation

Metric evaluating search agents by the ratio of valid information returns to tool calls — combining accuracy and efficiency into a single measure.

How it works

CAS = Σ(valid information returns) / Σ(tool calls) across all benchmark instances. Higher CAS means an agent that uses tools efficiently — answering correctly with minimal calls. The metric penalizes both incorrect answers (zero returns) and redundant calls (denominator grows).

Problem solved

Existing agent metrics measure only accuracy — an agent answering correctly after 12 rounds is treated the same as one answering after 3. CAS introduces the efficiency dimension: how much useful information the agent returns per tool call.

Implementation

Implementation pitfalls

Binary correctness scoring ignores partial answersMedium

CAS = valid returns / tool calls assumes binary scoring. A partially correct answer (e.g. 4/6 entities) counts as 0, which may understate CAS for models giving good but incomplete results.

Metric does not penalize high latency per callMedium

CAS counts tool call count, not their duration. An agent making 3 very slow calls may have higher CAS than one making 6 fast calls, despite higher total response time.

Hard to compare across benchmarks with different tool definitionsMedium

The granularity of one tool call varies across systems — one UGS call may correspond to 6 calls in a sequential system. CAS without tool granularity normalization favors architectures with coarser tools.

Evolution

Original paper · 2026 · Guankai Li

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu

Sources

HyperEyes arXiv paper

Paper