Reasoning

CoT

2022ActivePublished: 12 March 2026Updated: 29 May 2026Published

Key innovation

Demonstrating that prompting large language models to generate a series of intermediate natural-language reasoning steps before producing a final answer significantly improves performance on complex multi-step tasks — with this capability emerging as an emergent property of sufficiently large models.

How it works

1. Few-shot CoT: 4–8 exemplars are inserted into the prompt, each containing a full reasoning chain ending with a final answer (e.g. "Anna has 5 apples, gets 3 more, so 5+3=8. The answer is 8"). Conditioned on this pattern, the model produces an analogous chain for the new question. 2. Zero-shot CoT: a trigger phrase ("Let's think step by step") is appended to the question, and the model produces the chain + answer in a single pass. 3. Decoding: standard greedy decoding of one chain, or Self-Consistency — sample 10–40 independent chains with temperature > 0 and select the most frequent final answer by majority vote. 4. Extraction: the final answer is parsed from the model output after a marker like "The answer is" or as the final sentence of the chain.

Problem solved

Standard few-shot prompting fails on multi-step tasks — models produce immediate, incorrect answers because they try to solve a complex problem in one pass. Without explicit decomposition, models cannot reliably perform arithmetic, commonsense reasoning, or symbolic manipulations that require multiple dependent steps.

Components

Prompt with CoT ExamplesConditioning the model to generate reasoning steps before producing the final answer.

In few-shot CoT, the prompt contains a small number (typically 4–8) of exemplar problems whose answers are preceded by a chain of intermediate reasoning steps. In zero-shot CoT, a trigger phrase (e.g., 'Let's think step by step') is appended instead.

Few-shot CoTHuman-annotated exemplars with reasoning chains included in the prompt.

Zero-shot CoTTrigger phrase appended to the question without exemplars (Kojima et al., 2022).

Auto-CoTAutomatically generated exemplars via clustering and zero-shot generation (Zhang et al., 2022).

Official

Chain of ThoughtDecomposition of a complex problem into verifiable intermediate steps

The reasoning chain is the core output artifact of CoT. It consists of natural-language sentences that articulate sub-problems, intermediate computations, or logical deductions. It appears between the question and the final answer in the model output.

OUTNatural-language text: a sequence of reasoning steps terminating in a final answer token or phrase (e.g., 'The answer is X').

Final Answer ExtractionParses the final answer from model output that contains a chain-of-thought reasoning trace.

After the model generates its reasoning chain, the final answer is extracted from the output — either by greedy decoding of the last sentence, by matching a pattern (e.g., 'The answer is'), or by majority vote across multiple sampled chains (self-consistency).

Greedy decoding (single chain)Single greedy decoding run; final answer parsed from model output.

Majority voting (self-consistency)Multiple chains sampled; most frequent final answer selected (Wang et al., 2022).

Official

Implementation

Reference implementations

LangChain – chain-of-thought prompting

Python · LangChain

DSPy – ChainOfThought module

Python · Stanford NLP

PromptBench – CoT reasoning evaluation

Python · Microsoft Research

Auto-CoT – automatic chain-of-thought prompting

Python · Amazon Science (Zhang et al.)

Official

Implementation pitfalls

Unfaithful chains of reasoningHigh

A model may produce a plausible-looking reasoning chain that does not actually causally determine its final answer — the reasoning post-hoc rationalizes a decision made by other internal mechanisms. The chain may be misleading rather than explanatory.

Fix:Do not treat CoT outputs as reliable explanations. Verify final answers independently. Apply process reward models when faithful reasoning is required.

Scale dependency — small models show degraded performanceHigh

In base models without CoT-specific fine-tuning, CoT prompting may hurt performance in small models (below ~100B parameters at the time of the originating paper), as they generate plausible-sounding but incorrect intermediate steps.

Fix:Use sufficiently large models, or models fine-tuned on CoT data when working with smaller parameter counts.

Sensitivity to example quality and selectionMedium

The choice of few-shot exemplars significantly affects CoT performance. Poorly constructed, ambiguous, or domain-mismatched exemplars can degrade reasoning quality.

Fix:Carefully select examples; apply active selection methods (Active-Prompt) or automatic chain generation to identify the most informative examples for the target task.

Increased inference costMedium

Generating reasoning chains increases output token count, proportionally increasing latency and API cost relative to direct-answer prompting.

Fix:Use CoT selectively for tasks where it demonstrably improves accuracy; for simple tasks, direct prompting may suffice at lower cost.

Error accumulation across reasoning stepsHigh

An error in an early intermediate step propagates to all subsequent steps, often yielding a confidently stated but incorrect final answer.

Fix:Use self-consistency (majority voting over multiple sampled chains) to reduce the impact of single-chain errors; apply verification steps or external tool calls to check intermediate computations.

Evolution

Original paper · 2022 · NeurIPS 2022 · Jason Wei

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

2022

Few-shot Chain-of-Thought Prompting (Wei et al.)

Inflection point

Wei et al. demonstrate that few-shot prompting with reasoning-chain exemplars significantly improves LLM performance on arithmetic, commonsense, and symbolic reasoning. Establishes CoT as an emergent capability of large-scale models.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (paper)

2022

Zero-shot Chain-of-Thought (Kojima et al.)

Inflection point

Kojima et al. show that appending 'Let's think step by step' to a prompt elicits reasoning chains without any exemplars, making CoT applicable without manual annotation.

Large Language Models are Zero-Shot Reasoners (paper)

2022

Self-consistency decoding for CoT (Wang et al.)

Inflection point

Wang et al. propose sampling multiple diverse reasoning paths and selecting the most consistent final answer by majority vote, substantially improving CoT accuracy over greedy decoding.

Self-Consistency Improves Chain of Thought Reasoning in Language Models (paper)

2023

Tree of Thoughts (Yao et al.)

Yao et al. generalize CoT from linear chains to tree-structured search over intermediate thoughts, enabling backtracking and look-ahead in multi-step problem solving.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models (paper)

2024

Native reasoning models internalize CoT via RL (OpenAI o1)

Inflection point

OpenAI releases o1, a model trained via reinforcement learning on process-level reward signals to produce extended internal reasoning chains, rather than relying on CoT prompting. This represents a shift from prompting-elicited to trained-in reasoning.

2025

Open reasoning models released (DeepSeek-R1)

DeepSeek releases R1, an open-source model trained with group relative policy optimization (GRPO) to produce long reasoning chains natively, achieving performance comparable to o1 on reasoning benchmarks.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (paper)