At each step t the agent observes state s_t ∈ S, selects action a_t ∈ A according to policy π(a|s), the environment transitions to s_{t+1} ~ P(·|s_t, a_t) and returns reward r_t = R(s_t, a_t). Goal: find policy π* maximising the value function V^π(s) = E[Σ γ^t · r_t | s_0=s, π]. The optimal value function satisfies the Bellman equation: V*(s) = max_a [R(s,a) + γ Σ_{s'} P(s'|s,a) V*(s')]. MDPs are solved by: Value Iteration (iterative application of the Bellman operator), Policy Iteration (alternating policy evaluation and improvement), and linear programming. When P and R are unknown (model-free), RL algorithms (Q-learning, SARSA, policy gradients) are used, operating on sampled trajectories. The Markov property guarantees that the optimal policy is stationary and deterministic (for MDPs with discrete S and A).
How to mathematically formalise the problem of agent decision-making in a stochastic environment — in a way that allows proving the existence of an optimal policy and constructing algorithms to find it.
The set of all possible environment states. Can be discrete (finite or countable) or continuous (e.g. R^n).
The set of actions available to the agent. Can be discrete (e.g. {left, right, up, down}) or continuous (e.g. torque in robotics).
P(s'|s,a) — probability of transitioning to state s' after taking action a in state s. Defines the stochastic dynamics of the environment.
R(s,a) or R(s,a,s') — scalar reward returned by the environment. Defines the agent's objective — everything an MDP optimises is a sum of discounted rewards.
γ ∈ [0,1]. Weight of future rewards versus immediate ones. γ < 1 guarantees convergence of the reward series over an infinite horizon.
Official
π(a|s) — function mapping state to probability distribution over actions. The solution to an MDP is the optimal policy π*.
Official
If the state does not contain the full information needed to predict the future, the problem is not a valid MDP — algorithms may fail to converge to the optimal policy.
Exponential growth of state space size with dimensionality makes exact solutions infeasible.
In real-world tasks the agent rarely observes the full state — naively applying MDP instead of POMDP leads to suboptimal policy.
Standard MDP assumes stationary P and R. When the environment changes, the optimal policy changes too — requires extensions (non-stationary MDP, contextual MDP).
Andrey Markov defines stochastic processes with the memoryless property — mathematical precursor to MDP.
Richard Bellman introduces dynamic programming and the principle of optimality — foundation of methods for solving MDPs.
First formal definition of MDP — tuple (S, A, P, R) with value function and Bellman equation.
Ronald Howard publishes "Dynamic Programming and Markov Processes" — introduces Policy Iteration as alternative to Value Iteration.
Åström extends MDP to the case where the agent does not observe the full state — foundation for problems with incomplete information.
Chris Watkins proves convergence of Q-learning to the optimal policy in MDPs without knowing P and R — model-free RL.
First systematic analysis of MDPs with value function approximation — paving the way for Deep RL.
Canonical summary of the field — MDP becomes the universally used formalism in RL.
Time complexity: O(|S|² · |A|) per iteracja (Value Iteration). Space complexity: O(|S|² · |A|).
The state table grows exponentially with state dimensionality — exact solution becomes infeasible for |S| > ~10⁶. Hence the need for function approximation (Deep RL) or abstraction.
Number (or dimensionality) of states. Critically affects feasibility of exact algorithms — value iteration is O(|S|² · |A|) per iteration.
Number of actions per state (or dimensionality of continuous action space).
γ ∈ [0,1]. Defines planning horizon. Close to 1 → long horizon, slower algorithm convergence.
Number of decision steps — finite or infinite. Affects algorithm choice (finite-horizon DP vs. infinite-horizon iteration).
MDP is a formal model, not a computational architecture — execution depends on the chosen algorithm (Value Iteration, Policy Iteration, RL).
Value iteration updates are independent per state — max_a Q(s,a) computations for different s can be parallelised. Asynchronous DP (Sutton) allows non-uniform state update orders.
MDP is a formal mathematical model — has no hardware preference per se. Implementations of specific algorithms (VI, PI, Q-learning, Deep RL) have their own hardware preferences.