Evaluation

IMEB

2026ResearchPublished

Key innovation

First multimodal agent benchmark jointly evaluating accuracy and inference efficiency (tool cost), filling the gap of benchmarks measuring accuracy alone.

How it works

Each IMEB instance is an image with a question requiring simultaneous identification of multiple entities (e.g. 6 people, multiple products, multiple scientific objects). Evaluated: (1) accuracy — whether the answer is correct; (2) number of tool-call rounds; (3) CAS = correct information returns / number of tool calls. HyperEyes-30B achieves 64.0% advantage over the second-best model on IMEB.

Problem solved

Multimodal agent benchmarks reward only accuracy, ignoring inference cost. An agent answering correctly after 12 tool-call rounds is treated identically to one answering after 3. IMEB introduces efficiency as a measurable quality dimension.

Implementation

Implementation pitfalls

Small sample size (300) limits statistical significanceMedium

The benchmark consists of 300 instances — differences between models of a few percentage points may not be statistically significant. Bootstrap confidence intervals are recommended for comparisons.

No standardized entity count per instanceMedium

IMEB instances vary in the number of entities to identify. Models better at handling few entities may score higher not because they are more parallel, but because they encounter easier instances.

CAS is sensitive to the definition of valid information returnMedium

The CAS metric assumes binary correctness of returned information. In practice answers may be partially correct, requiring clear grading rules — lack of standardization makes cross-implementation comparisons difficult.

Evolution

Original paper · 2026 · Guankai Li

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu

Sources

HyperEyes arXiv paper

Paper