Researchers from XiaoHongShu and the University of Cambridge published HyperEyes on 8 May 2026 โ a multimodal search agent that processes multiple entities in parallel rather than sequentially. Across six benchmarks, the 30B version outperforms the best comparable open-source agent by 9.9% while using 5.3ร fewer tool-call rounds on average.
Key takeaways
- HyperEyes-30B: +9.9% accuracy vs best open-source agent, 5.3ร fewer tool-call rounds
- New IMEB benchmark (300 samples): HyperEyes outperforms the second-best model by 64.0%
- Core innovation: UGS (Unified Grounded Search) โ single tool call handles N entities concurrently
- Two-stage training: TRACE (trajectory-level rewards) + OPD (token-level correction from 235B teacher)
- Code and data publicly available on GitHub: github.com/DeepExperienceAI/HyperEyes
The sequential search bottleneck
Existing multimodal AI agents share a structural flaw: when a query involves multiple independent entities โ say, six people in a photograph โ the agent calls tools one entity at a time. For complex questions, this produces cascading rounds of latency, growing token consumption, and compounding error risk. The HyperEyes authors term this the "N-run bottleneck."
A head-to-head comparison with DeepEyes-V2 (an earlier sequential agent from the same group) makes the contrast concrete. Given a question requiring cross-referencing six individuals from a single image, DeepEyes-V2 required 12 tool-call rounds and still returned the wrong answer. HyperEyes issued one unified grounded query covering all six entities simultaneously and answered correctly in 3 rounds.
One action, many entities: UGS
The architectural centrepiece is the Unified Grounded Search (UGS) action space. Instead of a sequence of individual tool calls, HyperEyes emits a single atomic action that combines visual grounding with retrieval across all target entities in one step. This is the "search wider" strategy the paper advocates over the conventional "search deeper" (more sequential rounds) approach. The practical payoff: fewer tokens spent on coordination, fewer opportunities for cascading errors, and lower inference cost per correct answer.
Training: TRACE + OPD
Having a parallel action space is only half the solution โ the model must also learn to use it economically. That is the role of the Dual-Grained Efficiency-Aware Reinforcement Learning framework.
TRACE (Tool-use Reference-Adaptive Cost Efficiency) operates at the trajectory level. After each rollout, the reward for a correct answer is penalised proportionally to the number of tool-call steps taken. Crucially, the efficiency threshold is tightened monotonically across training epochs โ like raising a high-jump bar โ so a model that earns a reward at five steps must eventually achieve the same in three.
OPD (On-Policy Distillation) complements TRACE at the token level. It activates exclusively on failed trajectories, injecting a dense corrective signal from a 235B teacher model via KL divergence. This design preserves self-discovered efficient behaviours on successful rollouts while providing targeted correction on failures โ addressing the credit-assignment problem inherent in sparse outcome rewards.
Algorithm pseudocode
IMEB โ a new benchmark for multimodal efficiency
Existing multimodal benchmarks measure accuracy as the sole metric, ignoring inference cost. To fill this gap the authors introduced IMEB (Image Multi-Entity Benchmark): 300 human-curated instances spanning humanities and science, each requiring simultaneous identification of multiple image entities.
Alongside IMEB they propose CAS (Companion Assessment Standard), a metric defined as valid information returns per unit compute โ roughly, "useful information per tool-call step." On this metric, HyperEyes-30B achieves 7.6ร better information efficiency than sequential models.
Results across six benchmarks
HyperEyes-30B achieves state-of-the-art among open-source models on standard multimodal search benchmarks. The advantage is most visible on multi-entity efficiency โ see the table below.
| Benchmark | HyperEyes-30B | HyperEyes-235B | Avg. tool rounds |
|---|---|---|---|
| MMSearch | 64.1 | โ | 1.9 |
| FVQA | 58.0 | โ | 1.9 |
| IMEB (multi-entity) | 17.1 | 32.2 | 1.9 |
| K (permutation) | IMEB score | Deviation from mean |
|---|---|---|
| K=1โฆ10 | stable | < 1 pt |
| Mean | 17.1 | โ |
Technical diagram
The diagram below shows HyperEyes end-to-end: multi-entity input โ UGS โ parallel tool calls โ answer, with both training levels (TRACE + OPD) feeding the joint loss.
| Metric | DeepEyes-V2 | HyperEyes-30B |
|---|---|---|
| Avg. tool rounds | 7โ10 | 1.9 |
| Multi-entity task accuracy | low | higher |
| CAS (info per step) | 1.0ร | 7.6ร |
| Tool-call mode | sequential | parallel (UGS) |
| Cumulative step-by-step error | yes | significantly reduced |
Why it matters
Most multimodal agent research optimises for accuracy alone. HyperEyes reframes the objective: accuracy matters, but without inference cost control an agent is unsuitable for production deployment. Every tool call carries latency, API cost, and cumulative error risk.
XiaoHongShu โ a Chinese social-commerce platform with over 300 million monthly users โ has a direct business case for making visual search faster and cheaper at scale. This explains why efficiency was elevated to a "first-class training objective" rather than an afterthought. The UGS approach is broadly applicable wherever user queries involve multiple concurrent entities: visual product search, complex document analysis, biographical look-ups, multi-entity medical reasoning.
Independently of the model, IMEB and the CAS metric propose a new evaluation standard for search agents โ one that fills the blind spot of benchmarks that reward accuracy while ignoring its cost.
What's next?
- Code and training data are publicly available (github.com/DeepExperienceAI/HyperEyes), enabling open-source replication and adaptation
- IMEB is open for comparison; future work may extend it to higher entity counts, video inputs, or multi-page documents
- The authors note integration with visual search systems such as XiaoHongShu as a natural next step, though no production deployment timeline has been announced
