A Markov chain is defined by: (1) state space S, (2) initial distribution μ₀, (3) transition matrix P (or generator Q for CTMC). Evolution: the distribution at step n is μ_n = μ₀ · Pⁿ. States are classified as: recurrent vs. transient, periodic vs. aperiodic, communicating (same equivalence classes). Central theorems: (a) convergence theorem — an ergodic chain converges to a unique stationary distribution π = πP, (b) ergodic theorem — the time average of f(X_n) converges to the spatial average E_π[f]. Algorithms: computing π by solving a linear equation, matrix power iteration, von Mises iteration. In algorithmic practice (MCMC), Markov chains are constructed with a specified stationary distribution (e.g. Metropolis-Hastings, Gibbs sampling) — sampling from a distribution that is difficult to sample directly.
How to model and analyse stochastic systems evolving over time so that long-term properties (stationary distribution, return time, state classes) can be computed without tracking the full history.
The set of all possible chain states. Can be finite, countable (DTMC) or continuous.
P_ij = P(X_{n+1}=j | X_n=i). Stochastic matrix (rows sum to 1). Defines the full dynamics of the chain.
Probability distribution of the state at time 0. Often chosen as a deterministic distribution (X₀ = s₀ with probability 1).
Official
Distribution π satisfying π = πP. For an ergodic chain: unique and is the limit of distribution μ_n regardless of μ₀.
Modelling a system as a Markov chain when the state does not contain enough information to predict the future — leads to incorrect conclusions about stationary distribution and transition times.
A chain may take very long to converge to the stationary distribution (slow mixing) — especially in high-dimensional spaces with narrow corridors.
A chain may lack a unique stationary distribution (reducibility, periodicity) — then classical convergence theorems do not apply.
For very large |S| (e.g. word-level language models) matrix P cannot be explicitly stored. Naive power iteration loses numerical precision.
Andrey Markov extends the law of large numbers to dependent random variables — first formal definition of a Markov chain.
First application of chains to natural text — statistical analysis of vowel/consonant sequences in Pushkin's poem. Precursor of n-gram language models.
Metropolis et al. publish the first MCMC algorithm — using a Markov chain to sample from the Boltzmann distribution in statistical physics.
Bellman extends Markov chains with actions and rewards — defines MDP, the foundation of Reinforcement Learning.
W. K. Hastings publishes a generalisation — Metropolis-Hastings becomes the principal MCMC algorithm in Bayesian statistics.
Lawrence Rabiner publishes a foundational HMM tutorial — Markov chains become the dominant speech recognition model until the deep learning era.
PageRank defines page ranking as the stationary distribution of a "random surfer" Markov chain — foundation of the search engine revolution.
Sohl-Dickstein et al. use a forward Markov chain to gradually add noise to data — foundation of modern diffusion models (DDPM, Stable Diffusion).
Time complexity: O(|S|³) — obliczenie π przez rozwiązanie układu liniowego. Space complexity: O(|S|²).
Number or dimensionality of states. Affects size of matrix P (|S|×|S|) and stationary computation cost.
Discrete (DTMC, step sequence) vs. continuous (CTMC, generator Q and exponential probabilities).
Number of previous states influencing the next state distribution. Standard chain is order 1; n-grams in NLP are higher-order chains.
A Markov chain is a formal mathematical model — execution depends on the specific algorithm (forward simulation, Metropolis-Hastings, Gibbs sampling, eigen-decomposition).
A single chain is inherently sequential. Multiple independent chains can be parallelised (parallel tempering, multiple-chain MCMC). Matrix operations (matrix power of P) are fully parallelisable.
Sequential simulation of a Markov chain is typically CPU-bound. AVX vectorisation accelerates sampling from discrete distributions.
Matrix operations (matrix power of P, eigen-decomposition) and parallel sampling of multiple chains (parallel tempering, batch MCMC) are well accelerated on GPU.
A Markov chain is a formal mathematical model — has no hardware preference per se. Hardware choice depends on the specific implementation.