Tree of Thoughts (ToT) - tree-structured deliberate reasoning for LLMs | AI Technologies | Robots Atlas

Math puzzles requiring planningCombinatorial problemscode generation with testingAgent planningLogic games

ToT decomposes reasoning into three components: (1) Thought Generator - LLM produces candidates for the next reasoning step, (2) State Evaluator - LLM or a separate function evaluates the promise of each state, (3) Search Algorithm - BFS or DFS (or beam search) explores the state tree. The model can backtrack to a previous state and try a different branch.

Language models with CoT are confined to left-to-right, token-by-token decisions during inference, which fails on tasks requiring exploration, strategic planning, or where initial decisions are pivotal.

01

Thought Generator

Produces candidates for the next reasoning step from the current tree state.

Modular

LLM-based component generating k candidate thoughts. Two strategies: 'sample' (independent sampling from the same prompt - for rich thought spaces, e.g. Creative Writing) or 'propose' (sequential proposal of multiple thoughts in one prompt - for constrained spaces, e.g. Game of 24).

02

State Evaluator

Evaluates the promise of each intermediate state (thought) as a heuristic for the search algorithm.

Modular

LLM evaluates each state in two ways: 'value' (assigns numeric or categorical value to a single state, used in Game of 24) or 'vote' (compares multiple states and votes for the best, used in Creative Writing). The heuristic is soft and depends on LLM quality.

03

Search Algorithm

Explores the tree of thought states using classical graph algorithms.

Modular

ToT uses classical algorithms: BFS (Game of 24, Creative Writing - maintains b most promising states per level) or DFS (Mini Crosswords - explores deeply with backtracking when a state is judged unpromising). The algorithm is LLM-independent.

Time

…

k - branching factor (n_generate_sample), T - tree depth, c_LLM - cost of one LLM call. Worst-case full tree without pruning.

In practice BFS with beam width b bounds cost to O(b · k · T · c_LLM), which is 10-100x more expensive than CoT (O(T · c_LLM)).

Memory complexity

…

Number of active states in memory during BFS = b (beam width) · T (depth). Each state is a bounded-length text (context + thought).

Memory is dominated by accumulated LLM contexts, not the tree structure itself.

Wąskie gardło: LLM API calls

Each ToT step requires many LLM calls: generation of k candidates + evaluation n_evaluate_sample times for each of b states. API latency and cost grow multiplicatively with search parameters.

Parallelism

Partially parallel

Within a BFS level, generation of k candidates and their evaluation are fully parallel. Between levels the search remains sequential.

Paradigm

Conditional

Input dependent

Computation trajectory is conditional - depends on state evaluations. Different inputs produce different tree shapes and different paths.

Number of thought candidates (k)

Standard

1Game of 24 (propose)
5-10Creative Writing (sample)

Number of thoughts generated at each step by the generator. Larger k = broader exploration but linearly higher cost.

Beam width (b)

Critical

5Default for Game of 24 in paper
1Greedy (degenerates to CoT)

Number of states kept per BFS level. Classical beam search parameter - tradeoff between exploration and cost.

Number of evaluation samples

Standard

3Paper default

Number of LLM calls to evaluate each state (averaged to reduce variance).

Tree depth (T)

Standard

3Game of 24

Maximum number of reasoning steps in the tree. Task-dependent (Game of 24: 3, Crosswords: much deeper).

Generator strategy

Standard

sampleRich thought space
proposeConstrained thought space

Choice of thought generation strategy: sample (independent sampling) or propose (sequential proposals in one prompt).

Evaluator strategy

Standard

valueGame of 24
voteCreative Writing

Choice of state evaluation strategy: value (independent scoring) or vote (voting over a set).

Common pitfalls

High computational cost

HIGH

ToT requires multiple LLM calls per step (generation + evaluation x branches), making it 10-100x more expensive than CoT.

Thought evaluator can be unreliable

MEDIUM

Using the same LLM as both generator and evaluator creates circular reasoning risks; evaluation quality depends on model capability.

Reference implementations

princeton-nlp/tree-of-thought-llmofficial

Python · Princeton NLP

GENESIS · Source paper

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

2023NeurIPS 2023Shunyu Yao, Dian Yu, Jeffrey Zhao et al.

2022

Chain-of-Thought prompting as foundation

Wei et al. establish CoT as the baseline linear reasoning approach that ToT extends.

2022

Self-Consistency - first multi-path exploration

Wang et al. propose sampling multiple CoT paths and voting, a precursor to ToT.

2023

Tree of Thoughts (Yao et al., NeurIPS 2023)

breakthrough

Yao et al. formalize tree-structured deliberate reasoning with evaluation and backtracking.

2024

Graph of Thoughts and successors

Besta et al. extend ToT to arbitrary graph structures for more flexible reasoning topologies.

Hardware agnosticPRIMARY

ToT is an inference framework - it runs on any backend capable of LLM execution (API, local GPU, TPU). The search algorithm itself is CPU-bound and lightweight.

Performance depends on the LLM backend, not on ToT.

GPU Tensor CoresGOOD

Local ToT execution benefits from GPUs for parallel batching of k candidates and b states in one LLM inference call.

BUILT ON

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT

EXTENDS

Self-Consistency

Self-Consistency (Wang et al., ICLR 2023) is a decoding strategy for language models using Chain-of-Thought. Instead of using greedy decoding (selecting the most probable token at each step), Self-Consistency samples multiple diverse reasoning paths and then selects the answer that appears most frequently across all paths (marginalization over reasoning paths). The method is based on the intuition that a complex reasoning problem typically admits multiple different ways of reaching the same correct answer. Experiments demonstrated impressive CoT performance gains: +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA (Google, ICLR 2023).

GO TO CONCEPT

ALTERNATIVE TO

Self-Consistency

Self-Consistency (Wang et al., ICLR 2023) is a decoding strategy for language models using Chain-of-Thought. Instead of using greedy decoding (selecting the most probable token at each step), Self-Consistency samples multiple diverse reasoning paths and then selects the answer that appears most frequently across all paths (marginalization over reasoning paths). The method is based on the intuition that a complex reasoning problem typically admits multiple different ways of reaching the same correct answer. Experiments demonstrated impressive CoT performance gains: +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA (Google, ICLR 2023).

GO TO CONCEPT

Reflexion

Reflexion (Shinn et al., NeurIPS 2023) is a framework for reinforcing LLM agents not through weight updates, but through linguistic feedback. Reflexion agents verbally reflect on task feedback signals, then maintain these verbal reflections in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible regarding types of feedback signals (scalar values or free-form language) and sources (external or internally simulated). The method achieved 91% pass@1 on the HumanEval benchmark (coding), surpassing the then-SOTA GPT-4 at 80%. The work aligns with the paradigm of agents learning from experience without expensive RL.

GO TO CONCEPT

ReAct

ReAct (Reasoning + Acting) is a prompting pattern introduced by Yao et al. (ICLR 2023) in which a large language model generates an interleaved sequence of three token types: Thought (a natural-language reasoning step), Action (a tool invocation — e.g. search, calculator, API), and Observation (the tool's returned result). The Thought→Action→Observation loop repeats until the model emits a Finish action that produces the final answer. ReAct resolves a fundamental limitation of pure Chain-of-Thought: CoT generates reasoning solely from model parameters and is prone to factual hallucination. Pure tool-using LLMs (e.g. Toolformer) execute actions but lack explicit planning. ReAct integrates both — reasoning guides action selection, and observations update the reasoning state in subsequent steps. In the original paper, Yao et al. evaluated ReAct on four task classes: HotpotQA and FEVER (multi-hop QA / fact-checking with Wikipedia access), ALFWorld (interactive tasks in a simulated text environment), and WebShop (online shopping). On HotpotQA, ReAct reaches 27.4% EM vs 22.4% for CoT — the key advantage being hallucination reduction through Wikipedia fact verification. On ALFWorld, ReAct achieves 71% success vs 25% for the best imitation-learning baseline. ReAct became the dominant pattern for LLM agents in 2023–2024 and is built into most agent frameworks: LangChain (AgentExecutor), LlamaIndex (ReActAgent), AutoGPT, BabyAGI, OpenAI Assistants API. Later extensions include Reflexion (Shinn et al. 2023 — self-criticism across episodes), Tree of Thoughts (Yao et al. 2023 — tree-structured search instead of linear trajectory), and native function calling in model APIs (OpenAI, Anthropic, Google), which internalizes the ReAct loop at the protocol level.

GO TO CONCEPT

Commonly used with

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT

Title	Publisher	Type
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	—	scientific article

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

scientific article

Tree of Thoughts

Use cases

How it works

Problem solved

Main components

Thought Generator

State Evaluator

Search Algorithm

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

BUILT ON

EXTENDS

ALTERNATIVE TO

Commonly used with

Sources