For a given query (image + text question), the model identifies all entities requiring search, simultaneously generates bounding boxes (visual grounding) and retrieval queries for them in a single atomic action. Results from parallel searches are aggregated and the model generates the final answer. Example: question about 6 people in a photo โ 1 UGS action โ 6 parallel searches โ answer in 3 rounds instead of 12.
Sequential multimodal agents process one entity per round โ for queries with N entities this generates N tool-call rounds, accumulating latency, token costs, and error propagation risk. UGS eliminates this bottleneck.
When the model attempts to ground too many entities simultaneously, bounding boxes may overlap or cover incorrect image regions, degrading retrieval quality.
UGS is only as good as the base model's visual grounding โ poor grounding on complex images (crowds, small objects) directly results in incorrect retrieval queries.
A single UGS action triggers N parallel tool calls โ for queries with many entities the cost per round is higher than in a sequential agent, even though the total number of rounds is lower.
Parallel visual grounding and retrieval processing for N entities requires GPU for efficient multimodal model inference.