Alibaba Qwen Team multimodal VLM, 7B params, Apache 2.0. Processes images, long video (>1 hr) and documents. SOTA on DocVQA (95.7), ChartQA (87.3), OCRBench (864). GUI agent capabilities.
Context window
32K
tokens
Parameters
7B
parameters
Max output
32,768
tokens
Release date
1 January 2025
Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud
Overview
Access & deployment
DownloadAPIHosted
LocalCloud
Weights: Open source
Key parameters
📏 Context: 32K
🧩 Parameters: 7B
✓ Tools · ✓ Fine-tuning
📥 Input: text, image, video, documents…
Technical specification
Context window
32K
tokens
Parameters
7B
parameters
Max output tokens
32,768
tokens per response
License
Apache 2.0
Hardware requirements
GPU with at least ~16 GB VRAM (BF16). Flash Attention 2 recommended. Supported: Transformers, vLLM, SGLang.
Features:✓ Tool use✓ Fine-tuning
Modalities
⬇ Input
textimagevideodocumentsstructured_data
⬆ Output
textcodestructured_data
Capabilities and applications
Native model capabilities
Multimodal understanding
Category: multimodal
Image understanding
Analysing and interpreting the content of images.
Category: vision
Video understanding
The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.
Category: video
OCR
Recognising text within images and documents.
Category: vision
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multi-step reasoning
Carrying out multi-step chains of reasoning across long, complex tasks.
Category: reasoning
Diagram reasoning
Category: reasoning
Coding
Generating, analysing and modifying source code.
Category: coding
Agentic capability
The model's ability to autonomously plan and execute multi-step tasks by sequentially using tools, maintaining context, and adapting to intermediate results.
Category: planning
Computer use
The model's ability to operate a computer interface by interpreting screenshots and generating actions such as clicks, typing, and navigating applications.
Category: planning
Planning
Forming and executing action plans for complex tasks.
Category: planning
Long context
Maintaining coherence and focus across very long input context.
Category: language
Structured output
Producing data in structured formats such as JSON.
Category: structured_generation
Function Calling
Category: planning
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Interleaved Multimodal Input
Category: reasoning
Vision-language-action grounding
The ability of a VLA model to ground visual perception and a language instruction into a concrete physical robot action. The model understands the scene and intent, then generates an executable action sequence, closing the loop from observation to motion.
Category: robotics
Application domains
Benchmark results
15 benchmarks
MMMU
accuracy · val split, zero-shot
58.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (val split)
MMMU-Pro
accuracy · val split
41.0%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
DocVQA
ANLS · test split
95.7%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
ChartQA
relaxed accuracy · test split
87.3%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
TextVQA
accuracy · val split
84.9%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
OCRBench
score · test split
864/ 1000
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MathVista
accuracy · testmini split
68.2%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MMStar
accuracy · test split
63.9%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MMBench
accuracy · test split, English
82.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
Video-MME
accuracy · without subtitles / with subtitles
65.1 / 71.6% (wo/ w/ subs)
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
ScreenSpot
accuracy · GUI element localization, overall
84.7%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (agent benchmark)
OSWorld
success rate · AndroidWorld_SR
25.5%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (agent benchmark)
Score refers to the AndroidWorld_SR sub-task; the model achieves 93.7% on Android Control Low_EM.
MVBench
accuracy · video reasoning, multiple choice
69.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
MMVet
score · GPT-4-Turbo as judge
67.1
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
InfoVQA
ANLS · test split
82.6%
📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card
Technical architecture
Core Architecture
Model Form
Training Techniques
Sources and related pages
4 sources
DocsQwen2.5-VL-7B-Instruct — HuggingFace model cardBlogQwen2.5-VL blog post (Qwen Team, January 2025)PaperQwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (arXiv 2409.12191)PaperQwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities (arXiv 2308.12966)