At each step: (1) The actor π_θ(a|s) selects an action based on the current state. (2) The environment returns reward r and next state s'. (3) The critic V_w(s) (or Q_w) computes the temporal-difference (TD) error δ = r + γV_w(s') − V_w(s), an estimate of the advantage. (4) The critic is updated to minimize the TD error (regression). (5) The actor is updated with a policy gradient weighted by δ: ∇_θ log π_θ(a|s)·δ, increasing the probability of better-than-expected actions. Variants differ in advantage estimation (GAE), bootstrap steps (n-step), entropy use (SAC), or clipping (PPO).
Pure policy-gradient methods (REINFORCE) suffer from high estimator variance, slowing and destabilizing learning. Pure value-based methods (Q-learning) are hard to apply in continuous action spaces. Actor-Critic combines the strengths of both: low variance via the critic and direct policy parameterization for continuous actions.
A neural network producing the action distribution (categorical for discrete, Gaussian/squashed for continuous). Updated by a policy gradient weighted by the critic signal.
A network learning to evaluate states or state-action pairs, used to compute the TD error and reduce variance of the actor update.
A mechanism for computing the advantage: one-step TD error, n-step, or Generalized Advantage Estimation (GAE) with parameter λ.
Bootstrapping with function approximation can diverge (deadly triad: function approximation + bootstrapping + off-policy).
A single Q critic tends to overestimate action values, corrupting the policy.
The actor may prematurely collapse to a deterministic, suboptimal policy.
Critic and advantage updates are sensitive to the scale and variance of rewards.
Formulation of an "adaptive critic element" + "associative search element" solving the pole-balancing problem.
Sutton et al. formalize the policy gradient theorem, providing theoretical foundations for modern actor-critic methods.
Mnih et al. introduce A3C, demonstrating scalable, stable deep RL without a replay buffer.
Lillicrap et al. combine a deterministic actor with a Q critic for continuous action spaces.
Schulman et al. introduce PPO, today the most popular actor-critic variant and later the backbone of RLHF.
Haarnoja et al. add entropy regularization and twin critics, setting SoTA in continuous control.
V(s) vs Q(s,a) vs twin critics — determines the algorithm family (A2C vs DDPG/SAC).
Bias-variance trade-off parameter in Generalized Advantage Estimation (typically 0.9-0.97).
Weight of the entropy bonus encouraging exploration (key in SAC, A3C, PPO).
Relative learning rates of actor and critic; the critic is usually trained faster for stability.
Number of parallel environments/actors collecting data (A2C/A3C/IMPALA).
Both the actor and critic networks are active at every training step.
Asynchronous (A3C) and distributed (IMPALA, A2C) variants parallelize data collection across many actors. Network updates remain synchronous; the environment step is sequential.
Training the actor and critic networks plus batched updates benefit from GPUs; network sizes are moderate.
Data collection by many parallel actors (A3C/IMPALA) is often CPU-bound in simulation.