Training

RL

1998ActivePublished: 30 May 2026Updated: 30 May 2026Published

Key innovation

Formalised learning through environment interaction: an agent maximises cumulative reward via trial and error, without supervision in the form of labelled examples.

How it works

In state s the agent selects action a according to policy π(a|s); the environment returns reward r and a new state s'. This cycle produces a trajectory (s₀, a₀, r₀, s₁, a₁, r₁, …). The agent estimates state value V^π(s) or action value Q^π(s,a), satisfying the Bellman equation: Q^π(s,a) = E[r + γ · Q^π(s', a')]. RL algorithms split into: (1) value-based (Q-learning, DQN) — learning Q and selecting actions via argmax, (2) policy-gradient (REINFORCE, PPO, TRPO) — directly optimising π via the gradient of expected reward, (3) actor-critic (A3C, SAC, DDPG) — combining both, (4) model-based (Dyna, MuZero, Dreamer) — explicitly learning a dynamics model. Central challenges include the exploration–exploitation dilemma and credit assignment (attributing reward to delayed actions).

Problem solved

How to teach an agent to make sequential decisions in an environment where there are no labelled examples of "correct action" and the learning signal is delayed, sparse, and only partially informative (a scalar reward).

Components

AgentAction selection and policy update

The decision maker — learns the policy π(a|s) and selects actions based on state observations.

EnvironmentGenerates observations and rewards

The world the agent interacts with. Defines transition dynamics P(s'|s,a) and reward function R(s,a).

PolicyAgent's behavioural strategy

Function π(a|s) mapping a state to a probability distribution over actions. Can be deterministic or stochastic; tabular or parameterised by a neural network.

Official

Value functionLong-term consequence estimation

V^π(s) or Q^π(s,a) — expected cumulative discounted reward from a given state (or state–action pair) when following policy π.

Official

Reward functionGoal specification

R(s,a) — scalar learning signal returned by the environment. Defines the agent's goal — everything RL optimises is a sum of rewards.

Implementation

Reference implementations

Python (PyTorch / TensorFlow)

Implementation pitfalls

Reward hacking / specification gamingHigh

The agent finds a way to maximise reward that does not match designer intent — e.g. by exploiting a loophole in the reward function instead of solving the task.

Fix:Careful reward design, reward shaping, RLHF, constrained RL, regular validation of behaviour across diverse conditions.

Training instabilityCritical

Deep RL is notoriously unstable — small changes in hyperparameters or seeds yield drastically different results. Value function drift can lead to divergence.

Fix:Target networks (DQN), trust regions (TRPO/PPO), observation/reward normalisation, gradient clipping, multiple seeds.

Sample inefficiencyHigh

RL requires huge numbers of environment interactions (millions–billions of steps), making it impractical for real physical systems without simulation.

Fix:Model-based RL (MuZero, Dreamer), offline RL, pre-training on demonstrations (imitation learning + RL fine-tuning), sim-to-real transfer.

Catastrophic forgettingMedium

An agent learning new tasks can forget previously mastered skills, especially in continual / multi-task RL setups.

Fix:EWC, task replay buffers, modular policies, multi-task curricula.

Exploration in sparse-reward environmentsHigh

When reward is sparse (e.g. only at episode end), naive random exploration fails to discover solutions. Fundamental problem for long-horizon tasks.

Fix:Intrinsic motivation (curiosity, RND), hierarchical RL, reward shaping, expert demonstrations, goal-conditioned RL.

Evolution

Original paper · 1998 · MIT Press (1st ed. 1998, 2nd ed. 2018) · Richard S. Sutton

Reinforcement Learning: An Introduction

Richard S. Sutton, Andrew G. Barto

1957

Bellman equations

Inflection point

Richard Bellman formulates the mathematical foundations of dynamic programming and discounted reward — the theoretical foundation of RL.

1989

Q-learning (Watkins)

Inflection point

Chris Watkins introduces Q-learning — a model-free, off-policy, tabular algorithm for learning action-value functions.

1998

Sutton & Barto: Reinforcement Learning: An Introduction

Inflection point

First canonical summary of the field — defines the terminology and taxonomy still in use today.

2013

DQN — Deep Q-Network (DeepMind)

Inflection point

Mnih et al. combine Q-learning with a convolutional neural network and achieve superhuman performance on Atari games from raw pixels — the start of the Deep RL era.

Playing Atari with Deep Reinforcement Learning (paper)

2016

AlphaGo defeats Lee Sedol

Inflection point

DeepMind combines MCTS, RL and deep learning — defeats the world champion at Go, a problem previously considered decades away.

2017

PPO — Proximal Policy Optimization

Inflection point

OpenAI publishes PPO — a simple, stable policy-gradient algorithm that becomes the de-facto industry standard (later used in RLHF for GPT).

Proximal Policy Optimization Algorithms (paper)

2019

MuZero

DeepMind presents MuZero — model-based RL that learns environment dynamics without knowing its rules. State-of-the-art in Go, chess, shogi and Atari.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (paper)

2022

RLHF in ChatGPT

Inflection point

OpenAI uses RLHF (with PPO) to align GPT-3.5/4 with human preferences — RL enters the mainstream of consumer AI products for millions of users.

2024

RL for reasoning (o1, DeepSeek-R1)

Inflection point

OpenAI o1 and DeepSeek-R1 use RL on verifiable rewards (math, code) to learn long, step-by-step reasoning (chain-of-thought) — RL becomes the core mechanism of reasoning models.