A reasoning model uses a more deliberative inference mode, in which the model allocates additional tokens or computational steps to think through the task. This can involve decomposing the problem into stages, comparing multiple solution paths, checking for consistency, and only then generating the final answer.
Standard generative models often respond too quickly to difficult questions, increasing the risk of logical errors, skipped steps, and shallow reasoning. A reasoning model is designed to improve response quality on tasks that require deeper analysis.
Pretrained decoder-only language model (Transformer) forming the base of the reasoning model. The architecture is identical to standard LLMs — a reasoning model differs from a standard LLM exclusively in post-training.
Official
Sequence of tokens generated by the model before the final answer, containing reasoning steps, problem decomposition, self-verification, and corrections. Forms the model's working scratchpad and is the key mechanism for test-time scaling.
Component evaluating output quality during RL training. May be an outcome reward model (ORM, evaluating only the final answer correctness) or a process reward model (PRM, evaluating individual reasoning steps). The reward signal drives CoT generation policy learning.
Official
Algorithm optimizing the model's chain-of-thought generation policy based on reward signals. DeepSeek-R1 uses GRPO (Group Relative Policy Optimization). The specific RL algorithm used in OpenAI o1 has not been published.
Official
As shown by DeepSeek-R1-Zero, training via pure RL without SFT leads to emergent but poorly formatted reasoning chains: language mixing, endless repetition, poor readability. DeepSeek-R1 addresses this via cold-start data (SFT on a small set of exemplary CoT data before RL).
With insufficiently defined reward functions, the model may find ways to obtain high rewards without actually solving the problem (reward hacking). OpenAI noted this in the o1 system card.
Reasoning models may generate unnecessarily long chains of thought for simple tasks, increasing inference cost without improving answer quality. The 'overthinking' phenomenon has been described in 2025 research literature as a significant efficiency challenge.
Chain of thought in reasoning models does not guarantee that the visible reasoning trace corresponds to the model's actual internal computations. The CoT may be a 'post-hoc rationalization', complicating debugging and safety evaluation.
Wei et al. published 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models', showing that prompting LLMs to generate intermediate steps significantly improves performance on arithmetic and symbolic tasks.
Paper 'Let's Verify Step by Step' showed that supervising each reasoning step (PRM) 'significantly outperforms outcome supervision' on challenging math problems.
OpenAI released o1-preview and o1-mini on September 12, 2024 as the first publicly available 'reasoning model' series. Models trained via large-scale RL to use CoT. The term 'reasoning model' entered widespread use as a category name.
DeepSeek-AI published arXiv:2501.12948. First open, comprehensive technical documentation of reasoning model training using RL (GRPO) without SFT. DeepSeek-R1-Zero showed reasoning capabilities can emerge via pure RL without supervised fine-tuning. Open-source models released publicly.
Reasoning models generate significantly longer token sequences than standard LLMs due to extended CoT before the answer. Inference cost grows linearly with CoT length per query. For complex tasks, reasoning traces can span thousands of tokens, multiplying per-query cost relative to a standard LLM.
Limit or setting controlling the maximum number of CoT tokens generated before the final answer. Directly controls the quality/inference-cost trade-off.
Amount of compute dedicated to RL training (number of RL steps, rollout data size). OpenAI reports that o1 performance consistently improves with more RL training compute.
Choice between outcome reward model (ORM) and process reward model (PRM). Affects CoT quality, interpretability, and training cost.
The model processes both reasoning tokens and answer tokens through the same dense decoder layers. The activation pattern is stage-dependent: the CoT generation phase (reasoning stage) can run many times longer than the final answer generation phase (answer stage), though both use the same underlying model architecture.
RL training can be parallelized by processing multiple rollouts simultaneously. Inference for different queries is independent and can be handled in parallel by multiple model instances.
Reasoning models use the same Transformer decoder architecture as standard LLMs and require GPUs with Tensor Cores for efficient inference. Generating long CoT chains substantially increases VRAM demand (KV cache for long sequences) and GPU time per query.
TPU v4/v5 are used to train large reasoning models (e.g., by Google). They efficiently handle long token sequences via fast HBM memory and a GEMM-optimized architecture.