Robots Atlas>ROBOTS ATLAS
Artificial Intelligence

When should AI speak up? Tsinghua releases two proactive-AI benchmarks

When should AI speak up? Tsinghua releases two proactive-AI benchmarks

Two research teams from Tsinghua University published companion benchmarks in May 2026 — EgoIntrospect and IPIBench — each measuring a different dimension of how well multimodal large language models understand users and act without being asked. The results are consistent: today's models fail on both counts.

Key takeaways

  • EgoIntrospect: first egocentric dataset for user-centric internal state reasoning — 180 hours of recordings from 60 participants with synchronized video, audio, gaze, motion and physiological signals
  • IPIBench: first benchmark for evaluating proactive AI intelligence in continuous video streams — three task categories: monitoring, task management, and interleaved reactive-proactive requests
  • Both benchmarks expose the same two model weaknesses: inability to infer users' internal states and unstable proactive triggering
  • IPI-Agent — a training-free framework proposed by the IPIBench authors — consistently improves all tested models across all benchmark categories
  • EgoIntrospect dataset will be made publicly available; IPIBench is described in arXiv preprint 2605.27074

What today's AI assistants are missing

Current large language models are built as reactive systems — they wait, respond, stop. That works for a web chatbot. It does not work for AR glasses, home robots, or embodied assistants where there is no keyboard and no screen — just a user in motion, with shifting mood, intent and context.

Researchers at Tsinghua University's MEOW Lab asked the question directly: can modern multimodal models understand what a user needs before the user asks? And when is the right moment to speak up unsolicited? They pursued two separate answers in May 2026, both published on arXiv within weeks of each other.

EgoIntrospect: recording the inner person

The first paper — EgoIntrospect (arXiv:2605.17262, submitted May 17, 2026) — addresses the prerequisite question: does the model understand what's happening inside the user at all?

The team collected 180 hours of recordings from 60 participants, each wearing a cross-device capture rig for an average of 3 hours. The defining feature is synchronized multimodality: every recording combines first-person video, ambient audio, gaze tracking, motion data and physiological signals — all aligned in time. Critically, participants self-annotated their internal states: which moments were emotionally significant, when they formed a specific intent to interact with an AI assistant, and when they needed memory support.

The resulting benchmark tests three capabilities. The first is affective experience — can a model infer the user's emotional state from egocentric data? The second is interactive intent — does the model know when and why the user wants help? The third is cognitive memory — can the model recognize that a user has forgotten something and needs a reminder without explicitly asking?

Results are uniform: no tested model effectively infers users' subjective internal states. The models lose when multimodal streams are combined — video alone provides too little signal, but adding gaze and physiological data does not help because no current multimodal model can meaningfully fuse these inputs.

IPIBench: real-time, not replay

The second paper — IPIBench (arXiv:2605.27074, submitted May 26, 2026) — moves the problem to a more dynamic environment: what happens when a model must monitor a continuous video stream while simultaneously handling proactive and reactive tasks?

Existing benchmarks evaluate models on isolated video clips — the model watches a clip, then answers questions. In IPIBench, the video is a live feed and user instructions can arrive at any point, modifying or cancelling earlier ones. Imagine a user saying "remind me when the water boils" — and seconds later changing their mind to ask for a medication reminder instead. The model must cancel the first task, register the second and track both simultaneously.

The benchmark covers three task types. Proactive monitoring tests whether the model initiates a response at exactly the right moment without being prompted. Proactive task management adds complexity: modification, cancellation and tracking of multiple parallel user commitments. The third category interleaves reactive queries — direct user questions — with active proactive obligations. This is where coordination breaks down most severely.

Evaluation of representative multimodal models exposes two recurring failures. The first is unstable proactive triggering: models either miss the right moment, fire too early or delay too long. The second is weak coordination between reactive and proactive modes — when a user asks a new question mid-stream, the model loses track of its active proactive commitment.

IPI-Agent: a patch without retraining

The IPIBench authors also propose a remedy: IPI-Agent, a training-free framework that wraps any existing multimodal model with two control layers. The interaction-control policy separates incoming signals into two queues: reactive (what the user just said) and proactive (registered tasks waiting for a trigger condition). The temporal-gating mechanism adds a checkpoint before any proactive action — the model reviews its task history and the current video context before deciding to speak up. Experiments show IPI-Agent consistently improves results across all three benchmark categories and all tested models, with no weight modifications required.

Why this matters

The two papers probe the same deficiency from opposite angles. EgoIntrospect asks: does AI understand the person? IPIBench asks: can AI act at the right moment? Until models can pass both tests, they are unsuitable for anything beyond a standard chatbot.

This is not a compute problem — current models already process video in real time. The gap is conceptual: existing multimodal models treat the user as an object in a frame, not as a subject with emotions, intentions and a working memory. EgoIntrospect quantifies the understanding deficit. IPIBench quantifies the coordination deficit. Together they give the research community two specific measurement targets that did not exist before.

For the wearable and home robotics industry the implication is direct: the hardware is increasingly ready, the models are not. The benchmark gap is now clearly defined, which is the necessary condition for closing it.

What's next

  • The EgoIntrospect dataset will be made publicly available according to the preprint — the team published a project page; no specific release date announced
  • IPI-Agent's training-free design means teams at Google DeepMind, Meta AI or Anthropic can integrate it with existing multimodal models without retraining costs
  • Both benchmarks set a new evaluation floor for wearable and home robotics AI assistants — next-generation models will be measured against these tests

Sources

Share this article