Each IMEB instance is an image with a question requiring simultaneous identification of multiple entities (e.g. 6 people, multiple products, multiple scientific objects). Evaluated: (1) accuracy — whether the answer is correct; (2) number of tool-call rounds; (3) CAS = correct information returns / number of tool calls. HyperEyes-30B achieves 64.0% advantage over the second-best model on IMEB.
Multimodal agent benchmarks reward only accuracy, ignoring inference cost. An agent answering correctly after 12 tool-call rounds is treated identically to one answering after 3. IMEB introduces efficiency as a measurable quality dimension.
The benchmark consists of 300 instances — differences between models of a few percentage points may not be statistically significant. Bootstrap confidence intervals are recommended for comparisons.
IMEB instances vary in the number of entities to identify. Models better at handling few entities may score higher not because they are more parallel, but because they encounter easier instances.
The CAS metric assumes binary correctness of returned information. In practice answers may be partially correct, requiring clear grading rules — lack of standardization makes cross-implementation comparisons difficult.