In state s the agent selects action a according to policy π(a|s); the environment returns reward r and a new state s'. This cycle produces a trajectory (s₀, a₀, r₀, s₁, a₁, r₁, …). The agent estimates state value V^π(s) or action value Q^π(s,a), satisfying the Bellman equation: Q^π(s,a) = E[r + γ · Q^π(s', a')]. RL algorithms split into: (1) value-based (Q-learning, DQN) — learning Q and selecting actions via argmax, (2) policy-gradient (REINFORCE, PPO, TRPO) — directly optimising π via the gradient of expected reward, (3) actor-critic (A3C, SAC, DDPG) — combining both, (4) model-based (Dyna, MuZero, Dreamer) — explicitly learning a dynamics model. Central challenges include the exploration–exploitation dilemma and credit assignment (attributing reward to delayed actions).
How to teach an agent to make sequential decisions in an environment where there are no labelled examples of "correct action" and the learning signal is delayed, sparse, and only partially informative (a scalar reward).
The decision maker — learns the policy π(a|s) and selects actions based on state observations.
The world the agent interacts with. Defines transition dynamics P(s'|s,a) and reward function R(s,a).
Function π(a|s) mapping a state to a probability distribution over actions. Can be deterministic or stochastic; tabular or parameterised by a neural network.
Official
V^π(s) or Q^π(s,a) — expected cumulative discounted reward from a given state (or state–action pair) when following policy π.
Official
R(s,a) — scalar learning signal returned by the environment. Defines the agent's goal — everything RL optimises is a sum of rewards.
The agent finds a way to maximise reward that does not match designer intent — e.g. by exploiting a loophole in the reward function instead of solving the task.
Deep RL is notoriously unstable — small changes in hyperparameters or seeds yield drastically different results. Value function drift can lead to divergence.
RL requires huge numbers of environment interactions (millions–billions of steps), making it impractical for real physical systems without simulation.
An agent learning new tasks can forget previously mastered skills, especially in continual / multi-task RL setups.
When reward is sparse (e.g. only at episode end), naive random exploration fails to discover solutions. Fundamental problem for long-horizon tasks.
Richard Bellman formulates the mathematical foundations of dynamic programming and discounted reward — the theoretical foundation of RL.
Chris Watkins introduces Q-learning — a model-free, off-policy, tabular algorithm for learning action-value functions.
First canonical summary of the field — defines the terminology and taxonomy still in use today.
Mnih et al. combine Q-learning with a convolutional neural network and achieve superhuman performance on Atari games from raw pixels — the start of the Deep RL era.
DeepMind combines MCTS, RL and deep learning — defeats the world champion at Go, a problem previously considered decades away.
OpenAI publishes PPO — a simple, stable policy-gradient algorithm that becomes the de-facto industry standard (later used in RLHF for GPT).
DeepMind presents MuZero — model-based RL that learns environment dynamics without knowing its rules. State-of-the-art in Go, chess, shogi and Atari.
OpenAI uses RLHF (with PPO) to align GPT-3.5/4 with human preferences — RL enters the mainstream of consumer AI products for millions of users.
OpenAI o1 and DeepSeek-R1 use RL on verifiable rewards (math, code) to learn long, step-by-step reasoning (chain-of-thought) — RL becomes the core mechanism of reasoning models.
Weight of future rewards versus immediate ones. γ ∈ [0,1]. Values near 1 favour long-horizon planning; near 0 favour short-term behaviour.
Step size for updating policy or value function parameters. Too high → instability; too low → slow convergence.
Controls exploration–exploitation tradeoff. ε-greedy picks a random action with probability ε; policy-gradient methods use entropy regularisation.
Number of stored transitions (s, a, r, s') used for off-policy training. Larger buffer → more stable training, higher memory cost.
Number of transitions sampled from the replay buffer per update.
RL is not a single computational architecture but a training paradigm — execution depends on the specific algorithm (DQN, PPO, SAC, MuZero) and the neural network architecture approximating policy/value.
Most RL algorithms require sequential environment interaction (rollouts), which limits parallelism. Frameworks like A3C, IMPALA, Ape-X parallelise data collection across many actors on many machines, but gradient updates remain synchronised.
Deep RL uses neural networks trained via backpropagation — GPUs are optimal for matrix multiplication during policy/Q-function updates.
Environment simulation (MuJoCo, Atari, games) is often CPU-bound; distributed setups (IMPALA, Ape-X) use many CPU actors + a GPU learner.
TPUs used by DeepMind for large-scale experiments (AlphaGo, AlphaZero, MuZero); good for synchronous RL workloads with large batch sizes.