Architecture

Markov Chain

1906ActivePublished: 30 May 2026Updated: 30 May 2026Published

Key innovation

Formalisation of stochastic processes with the memoryless property — the future state depends only on the current state, not on the full history. Foundation of the entire theory of Markov processes, MDPs and RL.

How it works

A Markov chain is defined by: (1) state space S, (2) initial distribution μ₀, (3) transition matrix P (or generator Q for CTMC). Evolution: the distribution at step n is μ_n = μ₀ · Pⁿ. States are classified as: recurrent vs. transient, periodic vs. aperiodic, communicating (same equivalence classes). Central theorems: (a) convergence theorem — an ergodic chain converges to a unique stationary distribution π = πP, (b) ergodic theorem — the time average of f(X_n) converges to the spatial average E_π[f]. Algorithms: computing π by solving a linear equation, matrix power iteration, von Mises iteration. In algorithmic practice (MCMC), Markov chains are constructed with a specified stationary distribution (e.g. Metropolis-Hastings, Gibbs sampling) — sampling from a distribution that is difficult to sample directly.

Problem solved

How to model and analyse stochastic systems evolving over time so that long-term properties (stationary distribution, return time, state classes) can be computed without tracking the full history.

Components

State space (S)Representation of possible system configurations

The set of all possible chain states. Can be finite, countable (DTMC) or continuous.

Transition matrix (P)Stochastic dynamics

P_ij = P(X_{n+1}=j | X_n=i). Stochastic matrix (rows sum to 1). Defines the full dynamics of the chain.

Initial distribution (μ₀)Process starting point

Probability distribution of the state at time 0. Often chosen as a deterministic distribution (X₀ = s₀ with probability 1).

Official

Stationary distribution (π)Long-term chain behaviour

Distribution π satisfying π = πP. For an ergodic chain: unique and is the limit of distribution μ_n regardless of μ₀.

Implementation

Reference implementations

NumPy / SciPy linalg (Markov chain analysis)

Python

Official

Implementation pitfalls

Markov property violationCritical

Modelling a system as a Markov chain when the state does not contain enough information to predict the future — leads to incorrect conclusions about stationary distribution and transition times.

Fix:Extend state representation (state augmentation), use higher-order chains (n-gram), HMM with sufficient hidden states.

MCMC convergence issues (mixing time)High

A chain may take very long to converge to the stationary distribution (slow mixing) — especially in high-dimensional spaces with narrow corridors.

Fix:Convergence diagnostics (R-hat, ESS), parallel tempering, Hamiltonian Monte Carlo, NUTS, model reparametrisation.

Non-stationarity / non-ergodicityHigh

A chain may lack a unique stationary distribution (reducibility, periodicity) — then classical convergence theorems do not apply.

Fix:Check irreducibility and aperiodicity, analyse communicating classes, regularise via "lazy chain" (P' = (P+I)/2).

Numerical scale of matrix PMedium

For very large |S| (e.g. word-level language models) matrix P cannot be explicitly stored. Naive power iteration loses numerical precision.

Fix:Sparse representations, low-rank approximations, sampling instead of full matrix, log-space arithmetic.

Evolution

Original paper · 1906 · Bulletin of the Society of Physics and Mathematics of Kazan University · Andrey Markov

Extension of the law of large numbers to dependent quantities

Andrey Markov

1906

Markov defines chains of dependent variables

Inflection point

Andrey Markov extends the law of large numbers to dependent random variables — first formal definition of a Markov chain.

1913

Markov applies chains to "Eugene Onegin"

First application of chains to natural text — statistical analysis of vowel/consonant sequences in Pushkin's poem. Precursor of n-gram language models.

1953

Metropolis algorithm

Inflection point

Metropolis et al. publish the first MCMC algorithm — using a Markov chain to sample from the Boltzmann distribution in statistical physics.

Equation of State Calculations by Fast Computing Machines (paper)

1957

Markov Decision Process (Bellman)

Inflection point

Bellman extends Markov chains with actions and rewards — defines MDP, the foundation of Reinforcement Learning.

MDP (concept)

1970

Hastings generalises the Metropolis algorithm