Robots Atlas>ROBOTS ATLAS
HyperEyes

HyperEyes

30B / 7B
Parallel multimodal search agent by Xiaohongshu and University of Cambridge that processes multiple entities concurrently instead of sequentially — 9.9% more accurate than the best comparable open-source agent with 5.3× fewer tool-call rounds.
🔬 Research🔬 Research only⚖ Open sourceMultimodalTool-using modelAgent modelVision
Parameters
7B / 30B
parameters
Release date
8 May 2026
Access:DownloadDeployment:💻 Local☁ Cloud

Overview

HyperEyes is a parallel multimodal search agent developed by Xiaohongshu (rednote-hilab) and the University of Cambridge. It addresses the fundamental bottleneck of existing multimodal agents: the "N-run bottleneck", whereby tool calls are issued one entity at a time. For queries involving multiple independent entities — e.g., six people in a photograph — this produces cascading interaction rounds, growing token consumption, and compounding error risk.

The core innovation is UGS (Unified Grounded Search) — an atomic action that fuses visual grounding with multi-entity retrieval in a single step. Rather than "searching deeper" (more sequential rounds), HyperEyes "searches wider": one call covers all target entities concurrently.

Training proceeds in two stages: (1) cold-start via a Parallel-Amenable Data Synthesis Pipeline with Progressive Rejection Sampling, (2) Dual-Grained Efficiency-Aware Reinforcement Learning. TRACE (Tool-use Reference-Adaptive Cost Efficiency) operates at the trajectory level — rewarding correct answers with a penalty proportional to tool-call steps, with the efficiency threshold tightened monotonically across epochs. OPD (On-Policy Distillation) operates at the token level — activating exclusively on failed rollouts to inject dense corrective signals from a 235B teacher model via KL divergence.

The authors also introduced IMEB (Image Multi-Entity Benchmark) — 300 human-curated instances spanning humanities and science, each requiring simultaneous identification of multiple image entities. On IMEB, HyperEyes-30B outperforms the second-best model by 64.0%. The CAS (Companion Assessment Standard) metric measures useful information returns per unit compute; HyperEyes-30B achieves 7.6× better information efficiency than sequential models.

Classification
MultimodalTool-using modelAgent modelVision
Access & deployment
Download
LocalCloud
Weights: Open source
Key parameters
🧩 Parameters: 7B / 30B
Tools · ✓ Fine-tuning
📥 Input: text, image

Technical specification

Parameters
7B / 30B
parameters
License
CC BY 4.0
Features:Tool useFine-tuning
Modalities
⬇ Input
textimage
⬆ Output
text

Capabilities and applications

Native model capabilities
Multimodal understanding
Category: multimodal

Benchmark results

2 benchmarks
IMEB (Image Multi-Entity Benchmark)
+64.0% vs 2nd best
📄 arXiv:2605.07177
Authors' own benchmark — 300 human-curated instances requiring simultaneous identification of multiple image entities. Humanities and science domains.
6 multimodal search benchmarks (aggregate)
+9.9% accuracy, 5.3× fewer tool-call rounds
📄 arXiv:2605.07177
HyperEyes-30B vs strongest comparable open-source agent, averaged across 6 benchmarks.

Technical architecture

Training Techniques