
Parallel multimodal search agent by Xiaohongshu and University of Cambridge that processes multiple entities concurrently instead of sequentially — 9.9% more accurate than the best comparable open-source agent with 5.3× fewer tool-call rounds.
🔬 Research🔬 Research only⚖ Open sourceMultimodalTool-using modelAgent modelVision
Parameters
7B / 30B
parameters
Release date
8 May 2026
Access:DownloadDeployment:💻 Local☁ Cloud
Overview
Classification
MultimodalTool-using modelAgent modelVision
Access & deployment
Download
LocalCloud
Weights: Open source
Key parameters
🧩 Parameters: 7B / 30B
✓ Tools · ✓ Fine-tuning
📥 Input: text, image
Technical specification
Parameters
7B / 30B
parameters
License
CC BY 4.0
Features:✓ Tool use✓ Fine-tuning
Modalities
⬇ Input
textimage
⬆ Output
text
Capabilities and applications
Native model capabilities
Multimodal understanding
Category: multimodal
Benchmark results
2 benchmarks
IMEB (Image Multi-Entity Benchmark)
+64.0% vs 2nd best
📄 arXiv:2605.07177
Authors' own benchmark — 300 human-curated instances requiring simultaneous identification of multiple image entities. Humanities and science domains.
6 multimodal search benchmarks (aggregate)
+9.9% accuracy, 5.3× fewer tool-call rounds
📄 arXiv:2605.07177
HyperEyes-30B vs strongest comparable open-source agent, averaged across 6 benchmarks.
Technical architecture
Core Architecture
Model Form