Robots Atlas>ROBOTS ATLAS
QW

Qwen3-8B

3-8B · Family: Qwen3
Alibaba Qwen3-8B language model with 8.2B parameters, Apache 2.0. Hybrid thinking/non-thinking mode, 128K context, 119 language support, strong in math, coding and agent tasks.
✓ Active✓ Public access⚖ Open sourceLLMReasoning modelTool-using model📁 Qwen3
Context window
128K
tokens
Parameters
8.2B
parameters
Max output
32,768
tokens
Release date
29 April 2025
Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud📱 On-device

Overview

Qwen3-8B is a post-trained (instruct) language model from the Qwen3 family developed by the Qwen Team at Alibaba Group and released on 29 April 2025 under the Apache 2.0 licence. It is a dense model with 8.2 billion parameters (6.95B non-embedding), built on 36 Transformer layers with GQA (32 Q heads, 8 KV heads). The model is part of the Qwen3 series, which includes dense models from 0.6B to 32B and MoE models at 30B–235B.

Hybrid thinking mode

The key innovation of Qwen3 is support for two modes within a single model: thinking mode (enable_thinking=True) — the model generates step-by-step reasoning inside a <think>…</think> block, followed by the final answer; and non-thinking mode (enable_thinking=False) — a direct answer without explicit reasoning, similar to traditional chat models. The mode can be switched dynamically via /think and /no_think flags in the prompt or the enable_thinking parameter in the chat template. Recommended settings for thinking mode: Temperature=0.6, TopP=0.95, TopK=20. For non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20.

Architecture and pretraining

The architecture is based on Transformer with GQA, SwiGLU, RoPE and RMSNorm (similar to Qwen2.5), but without QKV-bias and with added QK-Norm for training stability. The model natively supports 128K-token context (extendable to 131K with YaRN). Pre-training covered approximately 36 trillion tokens in three stages: a general stage (30T tokens), a STEM/code-heavy stage (5T tokens) and a long-context stage (up to 32K, extended to 128K). Extended synthetic data was generated using Qwen2.5-VL, Qwen2.5-Math and Qwen2.5-Coder models.

Post-training (4 stages)

Post-training of the model included 4 stages: (1) Long-CoT cold start — fine-tuning on CoT data from mathematics, code and STEM. (2) Reasoning RL — reinforcement learning with rule-based rewards (GRPO). (3) Thinking Mode Fusion — integration of the non-thinking mode via SFT on mixed CoT + instruction data. (4) General RL — RL across 20+ general tasks (instruction following, format compliance, agent tasks). Smaller models (including 8B) were trained via strong-to-weak distillation from larger models instead of the full 4-stage pipeline.

Multilingual support and agent capabilities

The model supports 119 languages and dialects (vs. 29 in Qwen2.5), covering Indo-European, Sino-Tibetan, Afro-Asiatic, Austronesian and other language families. In terms of agent capabilities, Qwen3-8B integrates well with external tools via MCP (Model Context Protocol), supporting function calls in both thinking and non-thinking modes. Qwen-Agent is the recommended agent framework.

Classification
LLMReasoning modelTool-using model
Family: Qwen3
Access & deployment
DownloadAPIHosted
LocalCloudOn-device
Weights: Open source
Key parameters
📏 Context: 128K
🧩 Parameters: 8.2B
Tools · ✓ Fine-tuning
📥 Input: text

Technical specification

Context window
128K
tokens
Parameters
8.2B
parameters
Max output tokens
32,768
tokens per response
Knowledge cutoff
1 Apr 2025
Knowledge boundary
License
Apache 2.0
Hardware requirements
GPU with at least ~16 GB VRAM (BF16). Flash Attention 2 recommended. Supported: Transformers (>=4.51.0), vLLM (>=0.8.5), SGLang (>=0.4.6.post1), llama.cpp (>=b5401), Ollama.
Features:Tool useFine-tuning
Modalities
⬇ Input
text
⬆ Output
textcode

Capabilities and applications

Native model capabilities
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multi-step reasoning
Carrying out multi-step chains of reasoning across long, complex tasks.
Category: reasoning
Mathematical reasoning
The model's ability to solve mathematical tasks requiring multi-step reasoning — equations, proofs, combinatorics, geometry, calculus and competition-level problems.
Category: reasoning
Coding
Generating, analysing and modifying source code.
Category: coding
Multilingual
Understanding and generating text in many languages.
Category: language
Long context
Maintaining coherence and focus across very long input context.
Category: language
Agentic capability
The model's ability to autonomously plan and execute multi-step tasks by sequentially using tools, maintaining context, and adapting to intermediate results.
Category: planning
Planning
Forming and executing action plans for complex tasks.
Category: planning
Function Calling
Category: planning
Structured output
Producing data in structured formats such as JSON.
Category: structured_generation
Language modeling
Ability to predict subsequent tokens and generate coherent natural-language text based on the preceding context.
Category: language

Benchmark results

12 benchmarks
MMMU
accuracy · 5-shot, base model Qwen3-8B-Base
76.89%
📅 14 May 2025📄 Qwen3 Technical Report, Table 6 (arXiv 2505.09388)
Score for the base model (Qwen3-8B-Base). Instruction model results available in the technical report.
MMLU-Pro
accuracy · 5-shot CoT, base model Qwen3-8B-Base
56.73%
📅 14 May 2025📄 Qwen3 Technical Report, Table 6 (arXiv 2505.09388)
GPQA
accuracy · 5-shot CoT, base model Qwen3-8B-Base
44.44%
📅 14 May 2025📄 Qwen3 Technical Report, Table 6 (arXiv 2505.09388)
MATH
accuracy · 4-shot CoT, base model Qwen3-8B-Base
60.80%
📅 14 May 2025📄 Qwen3 Technical Report, Table 6 (arXiv 2505.09388)
Full MATH benchmark (full dataset). MATH-500 subset in thinking mode may yield higher scores.
GSM8K
accuracy · 4-shot CoT, base model Qwen3-8B-Base
89.84%
📅 14 May 2025📄 Qwen3 Technical Report, Table 6 (arXiv 2505.09388)
MGSM
accuracy · 8-shot CoT, multilingual math, base model Qwen3-8B-Base
76.02%
📅 14 May 2025📄 Qwen3 Technical Report, Table 6 (arXiv 2505.09388)
SWE-bench
pass@1 · post-training, thinking mode
%
📅 14 May 2025📄 Qwen3 Technical Report (arXiv 2505.09388) — patrz wyniki instruktu
Specific 8B score not separately published in available sources.
LiveCodeBench
pass@1 · post-training, thinking mode
%
📅 14 May 2025📄 Qwen3 Technical Report (arXiv 2505.09388)
Flagship Qwen3-235B achieves 70.7. The 8B score is not separately published.
BFCL (Berkeley Function-Calling Leaderboard)
accuracy · post-training, function calling
%
📅 14 May 2025📄 Qwen3 Technical Report (arXiv 2505.09388)
Flagship Qwen3-235B achieves 70.8. The 8B score is not separately published.
IFEval
accuracy · post-training, non-thinking mode
%
📅 14 May 2025📄 Qwen3 Technical Report (arXiv 2505.09388)
The 8B score is not separately published in available sources.
AIME 2024
pass@1 · post-training, thinking mode
%
📅 14 May 2025📄 Qwen3 Technical Report (arXiv 2505.09388)
Flagship Qwen3-235B achieves 85.7. The 8B score is not separately published.
AIME 2025
pass@1 · post-training, thinking mode
%
📅 14 May 2025📄 Qwen3 Technical Report (arXiv 2505.09388)
Flagship Qwen3-235B achieves 81.5. The 8B score is not separately published.

Technical architecture

Core Architecture
Model Form