Qwen2.5-VL-7B-Instruct

2.5-VL-7B-Instruct · Family: Qwen

Alibaba Qwen Team multimodal VLM, 7B params, Apache 2.0. Processes images, long video (>1 hr) and documents. SOTA on DocVQA (95.7), ChartQA (87.3), OCRBench (864). GUI agent capabilities.

✓ Active✓ Public access⚖ Open sourceMultimodalTool-using model📁 Qwen

Context window

32K

tokens

Parameters

parameters

Max output

32,768

tokens

Release date

1 January 2025

🏢AlibabaProducer

Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Qwen2.5-VL-7B-Instruct is the instruction-tuned release of the 7-billion-parameter model from the Qwen2.5-VL family developed by the Qwen Team at Alibaba Group. It is a VLM (Vision-Language Model) designed to process multimodal inputs: text, static images and video. Model weights are publicly available under the Apache 2.0 licence.

Architecture

The model is based on the Transformer architecture with an attached ViT (Vision Transformer) visual encoder. The visual encoder has been optimised using window attention, SwiGLU and RMSNorm — accelerating training and inference while maintaining performance. For video processing, Dynamic FPS Sampling and mRoPE with absolute time alignment are used — enabling understanding of videos of varying lengths and frame rates. The model natively supports a 32,768-token context window (extendable via YaRN). Dynamic Input Resolution: images are processed without altering aspect ratio, and the number of visual tokens adjusts to content (default range 4–16,384 tokens per image).

Key capabilities

The model excels in five areas: (1) document understanding — analysis of charts, tables, invoices, forms and scans with structured JSON output; (2) multi-level OCR — text recognition in scene photos, documents and handwriting; (3) long video understanding — videos over 1 hour with precise event timestamp pinpointing; (4) visual localisation — object detection via bounding box or point with JSON output; (5) agentic capabilities — computer and phone use as a visual agent (ScreenSpot 84.7, AndroidControl Low_EM 93.7).

Availability

The model is available for download from Hugging Face and ModelScope under the Apache 2.0 licence. It supports the Transformers, vLLM and SGLang libraries. Flash Attention 2 is recommended for efficient inference. The model is in BF16 format and requires a GPU with at least ~16 GB VRAM. Dynamic min_pixels/max_pixels configurations are available to balance inference quality and speed.

Classification

MultimodalTool-using model

Family: Qwen

Applications

Coding Document analysis Multimodal document understanding Video analytics

Access & deployment

DownloadAPIHosted

LocalCloud

Weights: Open source

Key parameters

📏 Context: 32K

🧩 Parameters: 7B

✓ Tools · ✓ Fine-tuning

📥 Input: text, image, video, documents…

Technical specification

Context window

32K

tokens

Parameters

parameters

Max output tokens

32,768

tokens per response

License

Apache 2.0

Hardware requirements

GPU with at least ~16 GB VRAM (BF16). Flash Attention 2 recommended. Supported: Transformers, vLLM, SGLang.

Features:✓ Tool use✓ Fine-tuning

Modalities

⬇ Input

textimagevideodocumentsstructured_data

⬆ Output

textcodestructured_data

Capabilities and applications

Native model capabilities

Multimodal understanding

Category: multimodal

Image understanding

Analysing and interpreting the content of images.

Category: vision

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

OCR

Recognising text within images and documents.

Category: vision

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Multi-step reasoning

Carrying out multi-step chains of reasoning across long, complex tasks.

Category: reasoning

Diagram reasoning

Category: reasoning

Coding

Generating, analysing and modifying source code.

Category: coding

Agentic capability

The model's ability to autonomously plan and execute multi-step tasks by sequentially using tools, maintaining context, and adapting to intermediate results.

Category: planning

Computer use

The model's ability to operate a computer interface by interpreting screenshots and generating actions such as clicks, typing, and navigating applications.

Category: planning

Planning

Forming and executing action plans for complex tasks.

Category: planning

Long context

Maintaining coherence and focus across very long input context.

Category: language

Structured output

Producing data in structured formats such as JSON.

Category: structured_generation

Function Calling

Category: planning

Vision encoder

The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.

Category: vision

Interleaved Multimodal Input

Category: reasoning

Vision-language-action grounding

The ability of a VLA model to ground visual perception and a language instruction into a concrete physical robot action. The model understands the scene and intent, then generates an executable action sequence, closing the loop from observation to motion.

Category: robotics

Application domains

Coding Document analysis Multimodal document understanding Video analytics

Benchmark results

15 benchmarks

MMMU

accuracy · val split, zero-shot

58.6%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (val split)

MMMU-Pro

accuracy · val split

41.0%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

DocVQA

ANLS · test split

95.7%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

ChartQA

relaxed accuracy · test split

87.3%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

TextVQA

accuracy · val split

84.9%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

OCRBench

score · test split

864/ 1000

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

MathVista

accuracy · testmini split

68.2%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

MMStar

accuracy · test split

63.9%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

MMBench

accuracy · test split, English

82.6%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

Video-MME

accuracy · without subtitles / with subtitles

65.1 / 71.6% (wo/ w/ subs)

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

ScreenSpot

accuracy · GUI element localization, overall

84.7%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (agent benchmark)

OSWorld

success rate · AndroidWorld_SR

25.5%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card (agent benchmark)

Score refers to the AndroidWorld_SR sub-task; the model achieves 93.7% on Android Control Low_EM.

MVBench

accuracy · video reasoning, multiple choice

69.6%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

MMVet

score · GPT-4-Turbo as judge

67.1

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

InfoVQA

ANLS · test split

82.6%

📅 1 Jan 2025📄 Qwen2.5-VL official HuggingFace model card

Technical architecture

Core Architecture

TRTransformer VIViT

Model Form

MLMultimodal LLM LLLLM

Training Techniques

ITInstruction Tuning

Sources and related pages

4 sources

DocsQwen2.5-VL-7B-Instruct — HuggingFace model cardhuggingface.co BlogQwen2.5-VL blog post (Qwen Team, January 2025)qwenlm.github.io PaperQwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (arXiv 2409.12191)arxiv.org PaperQwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities (arXiv 2308.12966)arxiv.org

Browse related topics

📁 Qwen 🌐 Coding 🌐 Document analysis 🌐 Multimodal document understanding 🌐 Video analytics 🧠 Transformer 🧠 ViT 🧠 Multimodal LLM All multimodal model models All tool using model models