1. Few-shot CoT: 4–8 exemplars are inserted into the prompt, each containing a full reasoning chain ending with a final answer (e.g. "Anna has 5 apples, gets 3 more, so 5+3=8. The answer is 8"). Conditioned on this pattern, the model produces an analogous chain for the new question. 2. Zero-shot CoT: a trigger phrase ("Let's think step by step") is appended to the question, and the model produces the chain + answer in a single pass. 3. Decoding: standard greedy decoding of one chain, or Self-Consistency — sample 10–40 independent chains with temperature > 0 and select the most frequent final answer by majority vote. 4. Extraction: the final answer is parsed from the model output after a marker like "The answer is" or as the final sentence of the chain.
Standard few-shot prompting fails on multi-step tasks — models produce immediate, incorrect answers because they try to solve a complex problem in one pass. Without explicit decomposition, models cannot reliably perform arithmetic, commonsense reasoning, or symbolic manipulations that require multiple dependent steps.
In few-shot CoT, the prompt contains a small number (typically 4–8) of exemplar problems whose answers are preceded by a chain of intermediate reasoning steps. In zero-shot CoT, a trigger phrase (e.g., 'Let's think step by step') is appended instead.
Official
The reasoning chain is the core output artifact of CoT. It consists of natural-language sentences that articulate sub-problems, intermediate computations, or logical deductions. It appears between the question and the final answer in the model output.
After the model generates its reasoning chain, the final answer is extracted from the output — either by greedy decoding of the last sentence, by matching a pattern (e.g., 'The answer is'), or by majority vote across multiple sampled chains (self-consistency).
Official
A model may produce a plausible-looking reasoning chain that does not actually causally determine its final answer — the reasoning post-hoc rationalizes a decision made by other internal mechanisms. The chain may be misleading rather than explanatory.
In base models without CoT-specific fine-tuning, CoT prompting may hurt performance in small models (below ~100B parameters at the time of the originating paper), as they generate plausible-sounding but incorrect intermediate steps.
The choice of few-shot exemplars significantly affects CoT performance. Poorly constructed, ambiguous, or domain-mismatched exemplars can degrade reasoning quality.
Generating reasoning chains increases output token count, proportionally increasing latency and API cost relative to direct-answer prompting.
An error in an early intermediate step propagates to all subsequent steps, often yielding a confidently stated but incorrect final answer.
Wei et al. demonstrate that few-shot prompting with reasoning-chain exemplars significantly improves LLM performance on arithmetic, commonsense, and symbolic reasoning. Establishes CoT as an emergent capability of large-scale models.
Kojima et al. show that appending 'Let's think step by step' to a prompt elicits reasoning chains without any exemplars, making CoT applicable without manual annotation.
Wang et al. propose sampling multiple diverse reasoning paths and selecting the most consistent final answer by majority vote, substantially improving CoT accuracy over greedy decoding.
Yao et al. generalize CoT from linear chains to tree-structured search over intermediate thoughts, enabling backtracking and look-ahead in multi-step problem solving.
OpenAI releases o1, a model trained via reinforcement learning on process-level reward signals to produce extended internal reasoning chains, rather than relying on CoT prompting. This represents a shift from prompting-elicited to trained-in reasoning.
DeepSeek releases R1, an open-source model trained with group relative policy optimization (GRPO) to produce long reasoning chains natively, achieving performance comparable to o1 on reasoning benchmarks.
Time complexity: O(k · T · C). Space complexity: O(k · T + L).
Generating the reasoning chain requires producing many more output tokens than a direct-answer approach. Each token requires one autoregressive model forward pass, making inference latency and compute proportional to chain length.
The number of (question, reasoning chain, answer) demonstrations included in the prompt. The original paper used 8 exemplars across benchmarks.
In zero-shot CoT, the phrase appended to the question to elicit reasoning. 'Let's think step by step' was introduced by Kojima et al. (2022).
Number of independently sampled chains for self-consistency decoding. Higher values improve accuracy but multiply compute cost.
CoT performance gains are strongly dependent on model scale. In Wei et al. (2022), benefits were observed primarily in models above ~100B parameters (PaLM 540B, GPT-3 175B). This threshold has shifted with later fine-tuned smaller models.
All model parameters are active during every inference pass. There is no sparse or conditional activation. CoT is a prompting strategy applied at inference time to a standard dense LLM.
Multiple independent chains (self-consistency) can be generated in parallel across a batch dimension, provided the compute budget allows.
CoT is an inference-time technique applied to LLMs, which operate most efficiently on GPUs with tensor cores for matrix multiplications in the attention and feed-forward layers of the transformer.
TPUs are commonly used for large-scale LLM inference; CoT is compatible with any hardware capable of running the base model.