Multimodal

UGS

2026ResearchPublished

Key innovation

Combines visual grounding and retrieval into a single atomic action handling N entities concurrently, replacing sequential tool calls with a single parallel query.

How it works

For a given query (image + text question), the model identifies all entities requiring search, simultaneously generates bounding boxes (visual grounding) and retrieval queries for them in a single atomic action. Results from parallel searches are aggregated and the model generates the final answer. Example: question about 6 people in a photo → 1 UGS action → 6 parallel searches → answer in 3 rounds instead of 12.

Problem solved

Sequential multimodal agents process one entity per round — for queries with N entities this generates N tool-call rounds, accumulating latency, token costs, and error propagation risk. UGS eliminates this bottleneck.

Implementation

Reference implementations

HyperEyes

Python

Official

Implementation pitfalls

Overly broad grounding reduces precisionMedium

When the model attempts to ground too many entities simultaneously, bounding boxes may overlap or cover incorrect image regions, degrading retrieval quality.

Dependency on base model visual grounding qualityMedium

UGS is only as good as the base model's visual grounding — poor grounding on complex images (crowds, small objects) directly results in incorrect retrieval queries.

Parallel tool calls increase cost per roundMedium

A single UGS action triggers N parallel tool calls — for queries with many entities the cost per round is higher than in a sequential agent, even though the total number of rounds is lower.

Evolution

Original paper · 2026 · Guankai Li

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu

Sources

HyperEyes arXiv paper

Paper

UGS

How it works

Problem solved

Implementation

Evolution

Sources

Execution paradigm

Parallelism

Hardware requirements