Robots Atlas>ROBOTS ATLAS
DeepEyes-V2

DeepEyes-V2

V2
Agentic multimodal model by Xiaohongshu (RedNote-hilab) integrating image understanding, web search and code execution within a unified reasoning chain.
๐Ÿ”ฌ Research๐Ÿ”ฌ Research onlyโš– Open sourceMultimodalTool-using modelAgent modelVision
Parameters
7B / 32B
parameters
Release date
7 November 2025
Access:DownloadDeployment:๐Ÿ’ป Localโ˜ Cloud

Overview

DeepEyes-V2 is an agentic multimodal model developed by rednote-hilab (the AI lab of Chinese social platform Xiaohongshu / RedNote). It extends the original DeepEyes v1 "Thinking with Images" concept with full external tool invocation: a Python code execution environment and web search. Both are integrated into a single end-to-end reasoning chain.

Training uses a two-stage pipeline: a cold-start phase establishes tool-use patterns via supervised fine-tuning on curated examples, followed by a reinforcement learning phase that refines tool-invocation decisions. The authors observe that direct RL alone fails to induce robust tool-use behavior. The model is built on Qwen-2.5-VL-7B-Instruct or Qwen-2.5-VL-32B-Instruct as its foundation.

Alongside the model, the team introduced RealX-Bench โ€” an evaluation benchmark that requires combined perception, search and reasoning on real-world tasks. DeepEyes-V2 shows task-adaptive tool invocation: image operations (zoom-in) for perception tasks, numerical computation in code for reasoning tasks. After RL training, the model composes tools into complex multi-step sequences.

Classification
MultimodalTool-using modelAgent modelVision
Access & deployment
Download
LocalCloud
Weights: Open source
Key parameters
๐Ÿงฉ Parameters: 7B / 32B
โœ“ Toolsย ยทย โœ“ Fine-tuning
๐Ÿ“ฅ Input: text, image

Technical specification

Parameters
7B / 32B
parameters
License
Apache-2.0
Hardware requirements
Training: 32+ GPUs (4 nodes ร— 8) for 7B variant; 64+ GPUs (8 nodes ร— 8) for 32B. Min. 1200 GB RAM per node due to high-resolution images in V* and ArxivQA datasets.
Features:โœ“ Tool useโœ“ Fine-tuning
Modalities
โฌ‡ Input
textimage
โฌ† Output
textcode

Capabilities and applications

Native model capabilities
Multimodal understanding
Category: multimodal

Benchmark results

1 benchmark
RealX-Bench
n/d
๐Ÿ“„ DeepEyesV2 paper (arXiv:2511.05271)
Team's own benchmark introduced alongside the model; evaluates integrated perception, search and reasoning on real-world tasks.

Technical architecture

Core Architecture
Training Techniques