Architecture

MDP

1957ActivePublished: 30 May 2026Updated: 30 May 2026Published

Key innovation

Formalisation of sequential decision-making under uncertainty as a tuple (S, A, P, R, γ) with the Markov property — the theoretical foundation of all Reinforcement Learning.

How it works

At each step t the agent observes state s_t ∈ S, selects action a_t ∈ A according to policy π(a|s), the environment transitions to s_{t+1} ~ P(·|s_t, a_t) and returns reward r_t = R(s_t, a_t). Goal: find policy π* maximising the value function V^π(s) = E[Σ γ^t · r_t | s_0=s, π]. The optimal value function satisfies the Bellman equation: V*(s) = max_a [R(s,a) + γ Σ_{s'} P(s'|s,a) V*(s')]. MDPs are solved by: Value Iteration (iterative application of the Bellman operator), Policy Iteration (alternating policy evaluation and improvement), and linear programming. When P and R are unknown (model-free), RL algorithms (Q-learning, SARSA, policy gradients) are used, operating on sampled trajectories. The Markov property guarantees that the optimal policy is stationary and deterministic (for MDPs with discrete S and A).

Problem solved

How to mathematically formalise the problem of agent decision-making in a stochastic environment — in a way that allows proving the existence of an optimal policy and constructing algorithms to find it.

Components

State space (S)Representation of world situations

The set of all possible environment states. Can be discrete (finite or countable) or continuous (e.g. R^n).

Action space (A)Decision choice

The set of actions available to the agent. Can be discrete (e.g. {left, right, up, down}) or continuous (e.g. torque in robotics).

Transition function (P)Environment dynamics

P(s'|s,a) — probability of transitioning to state s' after taking action a in state s. Defines the stochastic dynamics of the environment.

Reward function (R)Goal specification

R(s,a) or R(s,a,s') — scalar reward returned by the environment. Defines the agent's objective — everything an MDP optimises is a sum of discounted rewards.

Discount factor (γ)Balancing short- and long-term rewards

γ ∈ [0,1]. Weight of future rewards versus immediate ones. γ < 1 guarantees convergence of the reward series over an infinite horizon.

Official

Policy (π)Agent's decision strategy

π(a|s) — function mapping state to probability distribution over actions. The solution to an MDP is the optimal policy π*.

Deterministic policyπ(s) returns a single action.

Stochastic policyπ(a|s) returns a probability distribution.

Official

Implementation

Reference implementations

Implementation pitfalls

Markov property violationCritical

If the state does not contain the full information needed to predict the future, the problem is not a valid MDP — algorithms may fail to converge to the optimal policy.

Fix:Extend state representation (e.g. frame stacking), use POMDP, add memory (RNN, transformer) to the agent.

Curse of dimensionalityHigh

Exponential growth of state space size with dimensionality makes exact solutions infeasible.

Fix:Value function approximation (Deep RL), state aggregation, hierarchical decomposition, factored MDPs.

Partial observabilityHigh

In real-world tasks the agent rarely observes the full state — naively applying MDP instead of POMDP leads to suboptimal policy.

Fix:Model as POMDP, use belief states, memory-augmented agents (LSTM, transformer).

Non-stationary rewardsMedium

Standard MDP assumes stationary P and R. When the environment changes, the optimal policy changes too — requires extensions (non-stationary MDP, contextual MDP).

Fix:Model as contextual MDP, online learning, meta-learning, continual policy adaptation.