DeepEyes-V2

Agentic multimodal model by Xiaohongshu (RedNote-hilab) integrating image understanding, web search and code execution within a unified reasoning chain.

🔬 Research🔬 Research only⚖ Open sourceMultimodalTool-using modelAgent modelVision

Parameters

7B / 32B

parameters

Release date

7 November 2025

🔬XiaohongshuResearch lab

Access:DownloadDeployment:💻 Local☁ Cloud

Overview

DeepEyes-V2 is an agentic multimodal model developed by rednote-hilab (the AI lab of Chinese social platform Xiaohongshu / RedNote). It extends the original DeepEyes v1 "Thinking with Images" concept with full external tool invocation: a Python code execution environment and web search. Both are integrated into a single end-to-end reasoning chain.

Training uses a two-stage pipeline: a cold-start phase establishes tool-use patterns via supervised fine-tuning on curated examples, followed by a reinforcement learning phase that refines tool-invocation decisions. The authors observe that direct RL alone fails to induce robust tool-use behavior. The model is built on Qwen-2.5-VL-7B-Instruct or Qwen-2.5-VL-32B-Instruct as its foundation.

Alongside the model, the team introduced RealX-Bench — an evaluation benchmark that requires combined perception, search and reasoning on real-world tasks. DeepEyes-V2 shows task-adaptive tool invocation: image operations (zoom-in) for perception tasks, numerical computation in code for reasoning tasks. After RL training, the model composes tools into complex multi-step sequences.

Classification

MultimodalTool-using modelAgent modelVision

Access & deployment

Download

LocalCloud

Weights: Open source

Key parameters

🧩 Parameters: 7B / 32B

✓ Tools · ✓ Fine-tuning

📥 Input: text, image

Technical specification

Parameters

7B / 32B

parameters

License

Apache-2.0

Hardware requirements

Training: 32+ GPUs (4 nodes × 8) for 7B variant; 64+ GPUs (8 nodes × 8) for 32B. Min. 1200 GB RAM per node due to high-resolution images in V* and ArxivQA datasets.

Features:✓ Tool use✓ Fine-tuning

Modalities

⬇ Input

textimage

⬆ Output

textcode