Reasoning

Reasoning model

2024ActivePublished: 20 March 2026Updated: 20 March 2026Published

Key innovation

Training a language model with reinforcement learning to generate an extended chain of thought before producing an answer, enabling performance scaling through increased test-time compute independently of model size.

How it works

A reasoning model uses a more deliberative inference mode, in which the model allocates additional tokens or computational steps to think through the task. This can involve decomposing the problem into stages, comparing multiple solution paths, checking for consistency, and only then generating the final answer.

Problem solved

Standard generative models often respond too quickly to difficult questions, increasing the risk of logical errors, skipped steps, and shallow reasoning. A reasoning model is designed to improve response quality on tasks that require deeper analysis.

Components

LLM backbone (pretrained Transformer)Generates tokens — both reasoning tokens and final answer tokens — via autoregressive prediction.

Pretrained decoder-only language model (Transformer) forming the base of the reasoning model. The architecture is identical to standard LLMs — a reasoning model differs from a standard LLM exclusively in post-training.

Official

Chain of Thought ReasoningExtended intermediate processing that enables multiple passes over a problem before generating a final answer.

Sequence of tokens generated by the model before the final answer, containing reasoning steps, problem decomposition, self-verification, and corrections. Forms the model's working scratchpad and is the key mechanism for test-time scaling.

Model nagrody (reward model)Supplies the learning signal to the RL algorithm that drives the development of reasoning capabilities.

Component evaluating output quality during RL training. May be an outcome reward model (ORM, evaluating only the final answer correctness) or a process reward model (PRM, evaluating individual reasoning steps). The reward signal drives CoT generation policy learning.

Outcome Reward Model (ORM)Evaluates only the correctness of the final answer, e.g., via mathematical verification or code execution. Used in DeepSeek-R1-Zero.

Process Reward Model (PRM)Evaluates the quality of individual reasoning steps. Described in Lightman et al. (2023) 'Let's Verify Step by Step'.

Official

Reinforcement Learning (RL) Training AlgorithmTraining the model to productively generate CoT reasoning that leads to correct answers on verifiable tasks.

Algorithm optimizing the model's chain-of-thought generation policy based on reward signals. DeepSeek-R1 uses GRPO (Group Relative Policy Optimization). The specific RL algorithm used in OpenAI o1 has not been published.

GRPO (Group Relative Policy Optimization)RL algorithm used in DeepSeek-R1. Estimates baseline from group scores instead of a separate critic model. Described in Shao et al. (2024).

Official

Implementation

Reference implementations

DeepSeek-R1 – open-source reasoning model (DeepSeek-AI)

Python · DeepSeek-AI

Official

DeepSeek-R1 – Hugging Face Hub

Python · DeepSeek-AI

Official

Implementation pitfalls

Unstable and poorly readable CoT with pure RL and no cold-start dataHigh

As shown by DeepSeek-R1-Zero, training via pure RL without SFT leads to emergent but poorly formatted reasoning chains: language mixing, endless repetition, poor readability. DeepSeek-R1 addresses this via cold-start data (SFT on a small set of exemplary CoT data before RL).

Fix:Using cold-start data (SFT on curated CoT examples) before the RL phase to establish a baseline reasoning format. Explicitly defining the CoT format (e.g., via the <think>...</think> template).

Reward hacking – model exploits shortcuts in the reward systemHigh

With insufficiently defined reward functions, the model may find ways to obtain high rewards without actually solving the problem (reward hacking). OpenAI noted this in the o1 system card.

Fix:Using precise, verifiable reward functions (e.g., formal mathematical verification, code execution with unit tests). Avoiding rewards based solely on CoT length or other easily gamed metrics.

Overthinking – unnecessary CoT elongation for simple queriesMedium

Reasoning models may generate unnecessarily long chains of thought for simple tasks, increasing inference cost without improving answer quality. The 'overthinking' phenomenon has been described in 2025 research literature as a significant efficiency challenge.

Fix:Using configurable thinking budgets (thinking budget / reasoning effort settings). Routing complex queries to reasoning models and simple ones to standard LLMs.

CoT Infidelity – reasoning trace does not reflect actual inferenceMedium

Chain of thought in reasoning models does not guarantee that the visible reasoning trace corresponds to the model's actual internal computations. The CoT may be a 'post-hoc rationalization', complicating debugging and safety evaluation.

Fix:Apply CoT monitoring as described in the OpenAI o1 system card. Assess CoT faithfulness through perturbations and ablations. Account for CoT interpretability limitations when deploying in safety-critical systems.

Evolution

2022

Wei et al. (Google Brain) formalize Chain-of-Thought prompting

Inflection point

Wei et al. published 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models', showing that prompting LLMs to generate intermediate steps significantly improves performance on arithmetic and symbolic tasks.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (paper)

2023

Lightman et al. (OpenAI) demonstrate effectiveness of Process Reward Models

Inflection point

Paper 'Let's Verify Step by Step' showed that supervising each reasoning step (PRM) 'significantly outperforms outcome supervision' on challenging math problems.

Let's Verify Step by Step (paper)

2024

OpenAI introduces the "reasoning model" term and category with the o1 release (September 2024)

Inflection point

OpenAI released o1-preview and o1-mini on September 12, 2024 as the first publicly available 'reasoning model' series. Models trained via large-scale RL to use CoT. The term 'reasoning model' entered widespread use as a category name.

Learning to Reason with LLMs (OpenAI Blog) (paper)

2025

DeepSeek-R1 – first open, fully documented reasoning model (January 2025)

Inflection point

DeepSeek-AI published arXiv:2501.12948. First open, comprehensive technical documentation of reasoning model training using RL (GRPO) without SFT. DeepSeek-R1-Zero showed reasoning capabilities can emerge via pure RL without supervised fine-tuning. Open-source models released publicly.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (paper)