Reflexion

Replaced neural weight updates with verbal reinforcement learning, where the agent verbally reflects on error signals and stores conclusions in episodic memory, enabling rapid adaptation without expensive fine-tuning.

Common pitfalls

Context length limits episodic memory

HIGH

As reflections accumulate, the context window fills up, limiting the number of trials from which the agent can learn.

Reflections can reinforce wrong beliefs

MEDIUM

If the agent generates incorrect reflections (misattributes failures), subsequent trials may be misguided.

Reference implementations

noahshinn/reflexionofficial

Python · Noah Shinn

GENESIS · Source paper

Reflexion: Language Agents with Verbal Reinforcement Learning

2023NeurIPS 2023Noah Shinn, Federico Cassano, Edward Berman et al.

2022

ReAct - combining reasoning and acting

Yao et al. propose ReAct, interleaving reasoning traces with actions, precursor to Reflexion.

2023

Reflexion (Shinn et al., NeurIPS 2023)

breakthrough

Shinn et al. introduce verbal reflection with episodic memory as a substitute for RL fine-tuning.

2024

Reflexion integrated into agent frameworks

Reflexion-style reflection is adopted in LangChain, AutoGen, and other agent frameworks.

BUILT ON

ReAct

ReAct (Reasoning + Acting) is a prompting pattern introduced by Yao et al. (ICLR 2023) in which a large language model generates an interleaved sequence of three token types: Thought (a natural-language reasoning step), Action (a tool invocation — e.g. search, calculator, API), and Observation (the tool's returned result). The Thought→Action→Observation loop repeats until the model emits a Finish action that produces the final answer. ReAct resolves a fundamental limitation of pure Chain-of-Thought: CoT generates reasoning solely from model parameters and is prone to factual hallucination. Pure tool-using LLMs (e.g. Toolformer) execute actions but lack explicit planning. ReAct integrates both — reasoning guides action selection, and observations update the reasoning state in subsequent steps. In the original paper, Yao et al. evaluated ReAct on four task classes: HotpotQA and FEVER (multi-hop QA / fact-checking with Wikipedia access), ALFWorld (interactive tasks in a simulated text environment), and WebShop (online shopping). On HotpotQA, ReAct reaches 27.4% EM vs 22.4% for CoT — the key advantage being hallucination reduction through Wikipedia fact verification. On ALFWorld, ReAct achieves 71% success vs 25% for the best imitation-learning baseline. ReAct became the dominant pattern for LLM agents in 2023–2024 and is built into most agent frameworks: LangChain (AgentExecutor), LlamaIndex (ReActAgent), AutoGPT, BabyAGI, OpenAI Assistants API. Later extensions include Reflexion (Shinn et al. 2023 — self-criticism across episodes), Tree of Thoughts (Yao et al. 2023 — tree-structured search instead of linear trajectory), and native function calling in model APIs (OpenAI, Anthropic, Google), which internalizes the ReAct loop at the protocol level.

GO TO CONCEPT

Title	Publisher	Type
Reflexion: Language Agents with Verbal Reinforcement Learning	—	scientific article

Reflexion: Language Agents with Verbal Reinforcement Learning

scientific article

Back to technology catalog

Reflexion

Use cases

How it works

Problem solved

Implementation

Common pitfalls

Reference implementations

History and evolution

Semantic relations

BUILT ON

Sources