HyperEyes

30B / 7B

Parallel multimodal search agent by Xiaohongshu and University of Cambridge that processes multiple entities concurrently instead of sequentially — 9.9% more accurate than the best comparable open-source agent with 5.3× fewer tool-call rounds.

🔬 Research🔬 Research only⚖ Open sourceMultimodalTool-using modelAgent modelVision

Parameters

7B / 30B

parameters

Release date

8 May 2026

🔬XiaohongshuResearch lab

Access:DownloadDeployment:💻 Local☁ Cloud

Overview

HyperEyes is a parallel multimodal search agent developed by Xiaohongshu (rednote-hilab) and the University of Cambridge. It addresses the fundamental bottleneck of existing multimodal agents: the "N-run bottleneck", whereby tool calls are issued one entity at a time. For queries involving multiple independent entities — e.g., six people in a photograph — this produces cascading interaction rounds, growing token consumption, and compounding error risk.

The core innovation is UGS (Unified Grounded Search) — an atomic action that fuses visual grounding with multi-entity retrieval in a single step. Rather than "searching deeper" (more sequential rounds), HyperEyes "searches wider": one call covers all target entities concurrently.

Training proceeds in two stages: (1) cold-start via a Parallel-Amenable Data Synthesis Pipeline with Progressive Rejection Sampling, (2) Dual-Grained Efficiency-Aware Reinforcement Learning. TRACE (Tool-use Reference-Adaptive Cost Efficiency) operates at the trajectory level — rewarding correct answers with a penalty proportional to tool-call steps, with the efficiency threshold tightened monotonically across epochs. OPD (On-Policy Distillation) operates at the token level — activating exclusively on failed rollouts to inject dense corrective signals from a 235B teacher model via KL divergence.

The authors also introduced IMEB (Image Multi-Entity Benchmark) — 300 human-curated instances spanning humanities and science, each requiring simultaneous identification of multiple image entities. On IMEB, HyperEyes-30B outperforms the second-best model by 64.0%. The CAS (Companion Assessment Standard) metric measures useful information returns per unit compute; HyperEyes-30B achieves 7.6× better information efficiency than sequential models.

Classification

MultimodalTool-using modelAgent modelVision

Access & deployment

Download

LocalCloud

Weights: Open source

Key parameters

🧩 Parameters: 7B / 30B

✓ Tools · ✓ Fine-tuning

📥 Input: text, image

Technical specification

Parameters

7B / 30B

parameters

License

CC BY 4.0

Features:✓ Tool use✓ Fine-tuning

Modalities

⬇ Input

textimage

⬆ Output

text

Capabilities and applications

Native model capabilities

Multimodal understanding

Category: multimodal

Benchmark results

2 benchmarks

IMEB (Image Multi-Entity Benchmark)

+64.0% vs 2nd best

📄 arXiv:2605.07177

Authors' own benchmark — 300 human-curated instances requiring simultaneous identification of multiple image entities. Humanities and science domains.

6 multimodal search benchmarks (aggregate)

+9.9% accuracy, 5.3× fewer tool-call rounds

📄 arXiv:2605.07177

HyperEyes-30B vs strongest comparable open-source agent, averaged across 6 benchmarks.

Technical architecture

Core Architecture

MLMultimodal LLM AAAgentic AI

Model Form

MLMultimodal LLM

Training Techniques

RFRFT SFSFT

Sources and related pages

2 sources

PaperHyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents (arXiv:2605.07177)arxiv.org RepoDeepExperience/HyperEyes — GitHubgithub.com

Browse related topics

🧠 Multimodal LLM 🧠 Agentic AI 🧠 Multimodal LLM All multimodal model models All tool using model models