Robots Atlas>ROBOTS ATLAS

HyperEyes: multimodal agent that searches in parallel, not sequentially

HyperEyes: multimodal agent that searches in parallel, not sequentially

Researchers from XiaoHongShu and the University of Cambridge published HyperEyes on 8 May 2026 โ€” a multimodal search agent that processes multiple entities in parallel rather than sequentially. Across six benchmarks, the 30B version outperforms the best comparable open-source agent by 9.9% while using 5.3ร— fewer tool-call rounds on average.

Key takeaways

  • HyperEyes-30B: +9.9% accuracy vs best open-source agent, 5.3ร— fewer tool-call rounds
  • New IMEB benchmark (300 samples): HyperEyes outperforms the second-best model by 64.0%
  • Core innovation: UGS (Unified Grounded Search) โ€” single tool call handles N entities concurrently
  • Two-stage training: TRACE (trajectory-level rewards) + OPD (token-level correction from 235B teacher)
  • Code and data publicly available on GitHub: github.com/DeepExperienceAI/HyperEyes

The sequential search bottleneck

Existing multimodal AI agents share a structural flaw: when a query involves multiple independent entities โ€” say, six people in a photograph โ€” the agent calls tools one entity at a time. For complex questions, this produces cascading rounds of latency, growing token consumption, and compounding error risk. The HyperEyes authors term this the "N-run bottleneck."

A head-to-head comparison with DeepEyes-V2 (an earlier sequential agent from the same group) makes the contrast concrete. Given a question requiring cross-referencing six individuals from a single image, DeepEyes-V2 required 12 tool-call rounds and still returned the wrong answer. HyperEyes issued one unified grounded query covering all six entities simultaneously and answered correctly in 3 rounds.

One action, many entities: UGS

The architectural centrepiece is the Unified Grounded Search (UGS) action space. Instead of a sequence of individual tool calls, HyperEyes emits a single atomic action that combines visual grounding with retrieval across all target entities in one step. This is the "search wider" strategy the paper advocates over the conventional "search deeper" (more sequential rounds) approach. The practical payoff: fewer tokens spent on coordination, fewer opportunities for cascading errors, and lower inference cost per correct answer.

Training: TRACE + OPD

Having a parallel action space is only half the solution โ€” the model must also learn to use it economically. That is the role of the Dual-Grained Efficiency-Aware Reinforcement Learning framework.

TRACE (Tool-use Reference-Adaptive Cost Efficiency) operates at the trajectory level. After each rollout, the reward for a correct answer is penalised proportionally to the number of tool-call steps taken. Crucially, the efficiency threshold is tightened monotonically across training epochs โ€” like raising a high-jump bar โ€” so a model that earns a reward at five steps must eventually achieve the same in three.

OPD (On-Policy Distillation) complements TRACE at the token level. It activates exclusively on failed trajectories, injecting a dense corrective signal from a 235B teacher model via KL divergence. This design preserves self-discovered efficient behaviours on successful rollouts while providing targeted correction on failures โ€” addressing the credit-assignment problem inherent in sparse outcome rewards.

Algorithm pseudocode

Python
Algorithm 1: Dual-Grained Efficiency-Aware RL (TRACE + OPD)

Require: Prompts batch P; Student ฯ€_s; Teacher ฯ€_teacher; Initial reference l_t

for epoch e = 0, 1, โ€ฆ, E โˆ’ 1 do
  for prompt q โˆˆ P do

    // Step 1: Rollout Sampling
    T โ† ฯ„^k(q)                          โ–ท Sample rollouts from student policy
    for trajectory ฯ„ โˆˆ T do
      acc โ† Evaluate(ฯ„);  t_s โ† TurnCount(ฯ„)    โ–ท acc โˆˆ {0,1}: count tool turns

    // Step 2: TRACE Reward (Trajectory-level)
    if acc == 1 then
      R_trol โ† R^+ โˆ’ ฮป_t ยท t_s           โ–ท Reward efficiency, squeeze redundancy
    else
      R_trol โ† R^โˆ’                        โ–ท Penalize redundant searches
    end if

    // Step 3: OPD Distillation (Token-level)
    if acc == 0 then
      L_OPD โ† KL(ฯ€_s โ€– Teacher)          โ–ท Dense correction ONLY on failures
    else
                                          โ–ท Protect self-explored efficient behaviors
    end if

  end for

  // Step 4: Joint Optimization & Macro-level Update
  L_final โ† L_policy(R_trace) + ฮป_kl ยท L_OPD
  ฮธ โ† ฮธ โˆ’ ฮทโˆ‡_ฮธ L_final                  โ–ท Update student policy
  l_t โ† min(l_t, min(T_tol))            โ–ท Tighten reference (like a high jump bar)

end for

IMEB โ€” a new benchmark for multimodal efficiency

Existing multimodal benchmarks measure accuracy as the sole metric, ignoring inference cost. To fill this gap the authors introduced IMEB (Image Multi-Entity Benchmark): 300 human-curated instances spanning humanities and science, each requiring simultaneous identification of multiple image entities.

Alongside IMEB they propose CAS (Companion Assessment Standard), a metric defined as valid information returns per unit compute โ€” roughly, "useful information per tool-call step." On this metric, HyperEyes-30B achieves 7.6ร— better information efficiency than sequential models.

Results across six benchmarks

HyperEyes-30B achieves state-of-the-art among open-source models on standard multimodal search benchmarks. The advantage is most visible on multi-entity efficiency โ€” see the table below.

BenchmarkHyperEyes-30BHyperEyes-235BAvg. tool rounds
MMSearch64.1โ€”1.9
FVQA58.0โ€”1.9
IMEB (multi-entity)17.132.21.9
K (permutation)IMEB scoreDeviation from mean
K=1โ€ฆ10stable< 1 pt
Mean17.1โ€”

Technical diagram

The diagram below shows HyperEyes end-to-end: multi-entity input โ†’ UGS โ†’ parallel tool calls โ†’ answer, with both training levels (TRACE + OPD) feeding the joint loss.

MetricDeepEyes-V2HyperEyes-30B
Avg. tool rounds7โ€“101.9
Multi-entity task accuracylowhigher
CAS (info per step)1.0ร—7.6ร—
Tool-call modesequentialparallel (UGS)
Cumulative step-by-step erroryessignificantly reduced

Why it matters

Most multimodal agent research optimises for accuracy alone. HyperEyes reframes the objective: accuracy matters, but without inference cost control an agent is unsuitable for production deployment. Every tool call carries latency, API cost, and cumulative error risk.

XiaoHongShu โ€” a Chinese social-commerce platform with over 300 million monthly users โ€” has a direct business case for making visual search faster and cheaper at scale. This explains why efficiency was elevated to a "first-class training objective" rather than an afterthought. The UGS approach is broadly applicable wherever user queries involve multiple concurrent entities: visual product search, complex document analysis, biographical look-ups, multi-entity medical reasoning.

Independently of the model, IMEB and the CAS metric propose a new evaluation standard for search agents โ€” one that fills the blind spot of benchmarks that reward accuracy while ignoring its cost.

What's next?

  • Code and training data are publicly available (github.com/DeepExperienceAI/HyperEyes), enabling open-source replication and adaptation
  • IMEB is open for comparison; future work may extend it to higher entity counts, video inputs, or multi-page documents
  • The authors note integration with visual search systems such as XiaoHongShu as a natural next step, though no production deployment timeline has been announced

Sources

Share this article