Robots Atlas>ROBOTS ATLAS
QW

Qwen2.5-VL-7B-Instruct

2.5-VL-7B-Instruct · Family: Qwen
Alibaba Qwen Team multimodal VLM, 7B params, Apache 2.0. Processes images, long video (>1 hr) and documents. SOTA on DocVQA (95.7), ChartQA (87.3), OCRBench (864). GUI agent capabilities.
✓ Active✓ Public access⚖ Open sourceMultimodalTool-using model📁 Qwen
Context window
32K
tokens
Parameters
7B
parameters
Max output
32,768
tokens
Release date
1 January 2025
Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Qwen2.5-VL-7B-Instruct is the instruction-tuned release of the 7-billion-parameter model from the Qwen2.5-VL family developed by the Qwen Team at Alibaba Group. It is a VLM (Vision-Language Model) designed to process multimodal inputs: text, static images and video. Model weights are publicly available under the Apache 2.0 licence.

Architecture

The model is based on the Transformer architecture with an attached ViT (Vision Transformer) visual encoder. The visual encoder has been optimised using window attention, SwiGLU and RMSNorm — accelerating training and inference while maintaining performance. For video processing, Dynamic FPS Sampling and mRoPE with absolute time alignment are used — enabling understanding of videos of varying lengths and frame rates. The model natively supports a 32,768-token context window (extendable via YaRN). Dynamic Input Resolution: images are processed without altering aspect ratio, and the number of visual tokens adjusts to content (default range 4–16,384 tokens per image).

Key capabilities

The model excels in five areas: (1) document understanding — analysis of charts, tables, invoices, forms and scans with structured JSON output; (2) multi-level OCR — text recognition in scene photos, documents and handwriting; (3) long video understanding — videos over 1 hour with precise event timestamp pinpointing; (4) visual localisation — object detection via bounding box or point with JSON output; (5) agentic capabilities — computer and phone use as a visual agent (ScreenSpot 84.7, AndroidControl Low_EM 93.7).

Availability

The model is available for download from Hugging Face and ModelScope under the Apache 2.0 licence. It supports the Transformers, vLLM and SGLang libraries. Flash Attention 2 is recommended for efficient inference. The model is in BF16 format and requires a GPU with at least ~16 GB VRAM. Dynamic min_pixels/max_pixels configurations are available to balance inference quality and speed.

Classification
MultimodalTool-using model
Family: Qwen
Access & deployment
DownloadAPIHosted
LocalCloud
Weights: Open source
Key parameters
📏 Context: 32K
🧩 Parameters: 7B
Tools · ✓ Fine-tuning
📥 Input: text, image, video, documents

Technical specification

Context window
32K
tokens
Parameters
7B
parameters
Max output tokens
32,768
tokens per response
License
Apache 2.0
Hardware requirements
GPU with at least ~16 GB VRAM (BF16). Flash Attention 2 recommended. Supported: Transformers, vLLM, SGLang.
Features:Tool useFine-tuning
Modalities
⬇ Input
textimagevideodocumentsstructured_data
⬆ Output
textcodestructured_data

Capabilities and applications

Native model capabilities
Multimodal understanding
Category: multimodal
Image understanding
Analysing and interpreting the content of images.
Category: vision
Video understanding
The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.
Category: video
OCR
Recognising text within images and documents.
Category: vision
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multi-step reasoning
Carrying out multi-step chains of reasoning across long, complex tasks.
Category: reasoning
Diagram reasoning
Category: reasoning
Coding
Generating, analysing and modifying source code.
Category: coding
Agentic capability
The model's ability to autonomously plan and execute multi-step tasks by sequentially using tools, maintaining context, and adapting to intermediate results.
Category: planning
Computer use
The model's ability to operate a computer interface by interpreting screenshots and generating actions such as clicks, typing, and navigating applications.
Category: planning
Planning
Forming and executing action plans for complex tasks.
Category: planning
Long context
Maintaining coherence and focus across very long input context.
Category: language
Structured output
Producing data in structured formats such as JSON.
Category: structured_generation
Function Calling
Category: planning
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Interleaved Multimodal Input
Category: reasoning
Vision-language-action grounding
The ability of a VLA model to ground visual perception and a language instruction into a concrete physical robot action. The model understands the scene and intent, then generates an executable action sequence, closing the loop from observation to motion.
Category: robotics

Benchmark results

15 benchmarks
MMMU
accuracy · val split, zero-shot
58.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (val split)
MMMU-Pro
accuracy · val split
41.0%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
DocVQA
ANLS · test split
95.7%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
ChartQA
relaxed accuracy · test split
87.3%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
TextVQA
accuracy · val split
84.9%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
OCRBench
score · test split
864/ 1000
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MathVista
accuracy · testmini split
68.2%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MMStar
accuracy · test split
63.9%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MMBench
accuracy · test split, English
82.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
Video-MME
accuracy · without subtitles / with subtitles
65.1 / 71.6% (wo/ w/ subs)
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
ScreenSpot
accuracy · GUI element localization, overall
84.7%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (agent benchmark)
OSWorld
success rate · AndroidWorld_SR
25.5%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (agent benchmark)
Score refers to the AndroidWorld_SR sub-task; the model achieves 93.7% on Android Control Low_EM.
MVBench
accuracy · video reasoning, multiple choice
69.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MMVet
score · GPT-4-Turbo as judge
67.1
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
InfoVQA
ANLS · test split
82.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

Technical architecture

Core Architecture
Training Techniques