Reasoning

Actor-Critic

1983ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

Combines policy learning (actor) with value estimation (critic) in one architecture, reducing policy-gradient variance compared to pure REINFORCE without the bias of purely value-based methods.

How it works

At each step: (1) The actor π_θ(a|s) selects an action based on the current state. (2) The environment returns reward r and next state s'. (3) The critic V_w(s) (or Q_w) computes the temporal-difference (TD) error δ = r + γV_w(s') − V_w(s), an estimate of the advantage. (4) The critic is updated to minimize the TD error (regression). (5) The actor is updated with a policy gradient weighted by δ: ∇_θ log π_θ(a|s)·δ, increasing the probability of better-than-expected actions. Variants differ in advantage estimation (GAE), bootstrap steps (n-step), entropy use (SAC), or clipping (PPO).

Problem solved

Pure policy-gradient methods (REINFORCE) suffer from high estimator variance, slowing and destabilizing learning. Pure value-based methods (Q-learning) are hard to apply in continuous action spaces. Actor-Critic combines the strengths of both: low variance via the critic and direct policy parameterization for continuous actions.

Components

Actor (policy network)Parameterizes and samples the policy π(a|s)

A neural network producing the action distribution (categorical for discrete, Gaussian/squashed for continuous). Updated by a policy gradient weighted by the critic signal.

Critic (value network)Estimates the value function V(s) / Q(s,a) / A(s,a)

A network learning to evaluate states or state-action pairs, used to compute the TD error and reduce variance of the actor update.

State-value critic V(s)Used in A2C/A3C/PPO with advantage estimation (GAE).

Action-value critic Q(s,a)Used in DDPG/TD3/SAC for continuous control.

Twin critics (TD3, SAC)Two Q critics with a min to reduce value overestimation.

Advantage estimatorComputes the advantage A(s,a) that drives the actor

A mechanism for computing the advantage: one-step TD error, n-step, or Generalized Advantage Estimation (GAE) with parameter λ.

Implementation

Reference implementations

Stable-Baselines3 (A2C/PPO/SAC/TD3)

Implementation pitfalls

Critic learning instabilityHigh

Bootstrapping with function approximation can diverge (deadly triad: function approximation + bootstrapping + off-policy).

Fix:Target networks, twin critics, policy-step limits (PPO clip, TRPO), advantage normalization.

Value overestimationHigh

A single Q critic tends to overestimate action values, corrupting the policy.

Fix:Clipped double-Q (TD3/SAC) — minimum of two critics.

Exploration collapseMedium

The actor may prematurely collapse to a deterministic, suboptimal policy.

Fix:Entropy bonus (SAC), exploration noise (DDPG/TD3), tuning the entropy coefficient.

Reward-scale sensitivityMedium

Critic and advantage updates are sensitive to the scale and variance of rewards.

Fix:Return/advantage normalization, reward clipping, symlog (DreamerV3).

Evolution

Original paper · 1983 · IEEE Transactions on Systems, Man, and Cybernetics 1983 · Andrew G. Barto

Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems

Andrew G. Barto, Richard S. Sutton, Charles W. Anderson

1983

Barto-Sutton-Anderson — first actor-critic

Inflection point

Formulation of an "adaptive critic element" + "associative search element" solving the pole-balancing problem.

2000

Policy Gradient Theorem and function-approximation actor-critic

Inflection point

Sutton et al. formalize the policy gradient theorem, providing theoretical foundations for modern actor-critic methods.

Policy Gradient Methods for Reinforcement Learning with Function Approximation (paper)

2016

A3C — asynchronous deep actor-critic

Inflection point

Mnih et al. introduce A3C, demonstrating scalable, stable deep RL without a replay buffer.

Asynchronous Methods for Deep Reinforcement Learning (paper)

2015

DDPG — off-policy continuous control

Lillicrap et al. combine a deterministic actor with a Q critic for continuous action spaces.

Continuous Control with Deep Reinforcement Learning (paper)

2017

PPO — clipped actor-critic

Inflection point

Schulman et al. introduce PPO, today the most popular actor-critic variant and later the backbone of RLHF.

PPO (concept)Proximal Policy Optimization Algorithms (paper)

2018

SAC — maximum-entropy actor-critic

Haarnoja et al. add entropy regularization and twin critics, setting SoTA in continuous control.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL (paper)

Sources

Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems (Barto-Sutton-Anderson)

Paper

IEEE TSMC 1983

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Paper

NeurIPS 1999

Asynchronous Methods for Deep Reinforcement Learning (A3C)

Paper

arXiv / ICML 2016

Continuous Control with Deep Reinforcement Learning (DDPG)

Paper

arXiv / ICLR 2016

Proximal Policy Optimization Algorithms (PPO)

Paper

arXiv

Soft Actor-Critic (SAC)

Paper

arXiv / ICML 2018

Sutton & Barto, Reinforcement Learning: An Introduction (Ch. 13)

Documentation

MIT Press

Actor-Critic

How it works

Problem solved

Components

Implementation

Evolution

Sources

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements