TextGrad
How it works
1) Forward pass — the AI system executes its computation: prompt → LLM call → response → subsequent steps. Each intermediate variable is wrapped in a Variable object. 2) Loss — a loss function expressed in natural language (e.g. ‘judge whether the answer is correct and complete’) returns a critique. 3) Backward pass — a teacher model (typically GPT-4 or stronger) generates a textual gradient: for each variable it produces a critique pointing out what should be improved, propagating information backward through the computation graph. 4) Optimizer step — TGD (Textual Gradient Descent) uses the textual gradient to produce a new version of the variable (e.g. an improved prompt or code). Variants include vanilla TGD, TGD-Momentum, Constrained TGD, and Batch TGD, analogously to SGD/Adam in PyTorch. 5) Iteration — the forward–backward–step cycle is repeated 5–50 times until a satisfactory loss value is reached.
Problem solved
Manual tuning of prompts, code and configurations in complex LLM systems is laborious, poorly reproducible, and does not scale to multi-step agentic pipelines. Classical optimization methods (gradient descent on weights) do not apply at the application layer, where models are used as black boxes via APIs. TextGrad addresses this by providing an autograd-like abstraction: instead of numeric gradients it back-propagates natural-language critiques, enabling automatic improvement of any textual artefact (prompt, code snippet, task solution) against a defined loss function.
Components
Wrapper for a textual variable (prompt, code, answer). Stores a value and a gradient (critique). Analogous to `torch.Tensor` with `requires_grad=True`. Can be marked as trainable or frozen.
The language-model engine used for forward passes and for generating textual gradients. Typically a GPT-4-class model (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro). Gradient quality depends strongly on engine capability.
Official
A loss function expressed in natural language (e.g. ‘judge the mathematical correctness of the answer and point out errors’). It can be implemented as a prompt to a teacher model, a classifier, or a scalar metric with description.
Official
Textual Gradient Descent — an optimizer that updates variables based on textual gradients. Variants: vanilla TGD, TGD-Momentum (accumulating historical critiques), Constrained TGD (with form constraints), Batch TGD (aggregating gradients across multiple samples).
Official
Implementation
TextGrad requires a strong GPT-4-class teacher model. A typical 5–50 iteration session costs 10–100 USD in API calls. Weaker engines produce low-quality gradients and the session may fail to converge.
A natural-language loss function must be specific and unambiguous. Vague instructions (e.g. ‘make it better’) produce inconsistent critiques and the optimizer oscillates between solutions instead of converging.
Iterative optimization can over-specialize the prompt to the specific examples in the TextGrad training set, degrading generalization to new tasks.
LLM-generated critiques are stochastic — two sessions on the same input can follow different optimization paths and yield different final results, complicating debugging and reproducibility.
Sequential forward–backward–step iterations on a strong LLM take from minutes to hours. TextGrad is not suitable for real-time optimization or large search spaces.
Evolution
The Stanford team (Yuksekgonul, Zou et al.) publishes the first version of the TextGrad framework as a PyTorch analogue for text, with results on GPQA, LeetCode Hard, radiotherapy planning, and SMILES.
Open-source release on GitHub under MIT license, PyPI package `textgrad` published. Support for OpenAI, Anthropic, Together AI, and local models.
TextGrad results are published in Nature, formalizing the textual-differentiation paradigm and confirming gains over Chain-of-Thought on GPQA (+5 p.p.) and LeetCode Hard (+20 p.p.).
Technical details
Hyperparameters (configurable axes)
Number of full forward–backward–step cycles. Typically 5–50. More iterations increase the chance of improvement but linearly raise the cost.
Choice of the teacher model. Nature results were obtained with GPT-4. Weaker engines (GPT-3.5, open-source models <70B) produce lower-quality gradients.
vanilla TGD, TGD-Momentum, Constrained TGD or Batch TGD. Momentum reduces oscillations under noisy critiques, Constrained enforces output format.
Natural-language formulation of the objective. The specificity and correctness of the loss determine optimization quality — it directly shapes the critiques.
Execution paradigm
Sparse in the sense of compute cost: in a typical iteration only selected fragments of the computation graph (paths leading to losses) are active.
TextGrad does not introduce routing in the classical MoE sense. Each variable has a single gradient produced by the teacher model in a given iteration.
Parallelism
Forward passes and loss computations across samples in a batch are parallelizable. The backward pass for a single sample is largely sequential due to graph dependencies. Batch TGD allows parallelizing gradients across samples.
Hardware requirements
TextGrad operates at the application layer (LLM APIs) and does not require specific hardware on the user side. Hardware requirements lie on the side of the teacher-model provider's infrastructure.