Each agent i observes (potentially partial) state o_i, selects action a_i ~ π_i(·|o_i), the environment evolves according to P(s'|s, a₁, …, a_N) and returns rewards r_i = R_i(s, a₁, …, a_N). Agent objectives can be aligned (cooperation) or opposed (competition). Training is mainly done in the CTDE paradigm (centralised training, decentralised execution): during training the critic sees the global state and all actions (centralised critic), while policies π_i are local. In competitive MARL, self-play is used (agent plays against its own earlier versions — foundation of AlphaZero, AlphaStar). Main algorithm families: (a) Independent Learning (IQL) — each agent treats others as part of the environment, simple but unstable, (b) Value Decomposition (VDN, QMIX) — joint Q-function decomposes as sum/monotonic of Q_i, (c) Actor-Critic with centralised critic (MADDPG, MAPPO, COMA), (d) Communication-based — agents learn communication protocols, (e) Mean-field MARL — scale approximation for very large N. Key game-theoretic concepts: Nash equilibrium, Pareto-optimality, social welfare, correlated equilibrium.
How to teach a group of agents to cooperate (or compete) effectively in conditions where the environment is non-stationary from each agent's perspective because the other agents are also learning and changing their behaviour.
Formal mathematical structure of MARL: tuple (N, S, {A_i}, P, {R_i}, γ). Extends MDP to multiple agents with individual action spaces and rewards.
Individual policies π_i(a_i|o_i) of each agent. Under CTDE they are executed decentrally on local observations.
Official
Global value function used only during training (CTDE). Has access to joint state and actions of all agents, which stabilises learning.
Official
A stable learning point from game theory: Nash, correlated equilibrium, Pareto-optimal — defines the "solution" of a multi-agent game.
Training mechanism in competitive games — the agent plays against current and earlier versions of itself. Generates a natural difficulty curriculum.
Official
From a single agent's perspective the environment is non-stationary — other agents learn and change their policies. Naively applying Independent Q-learning breaks convergence assumptions.
In cooperative MARL it is hard to determine which agent contributed to the global reward. All naive methods produce lazy/free-rider behaviour.
Joint action space grows exponentially with N: |A|^N. For 10 agents with 10 actions each, that is already 10¹⁰ joint actions — not directly tractable.
In general-sum games poorly chosen rewards lead to dominance of one agent, social dilemmas (Tragedy of the Commons) or reward hacking.
In general-sum games there is no guarantee that gradient-based learning converges to a Nash equilibrium — cycles, drift, and exploitation loops are possible.
Most Deep MARL algorithms are designed for N ≤ 10–20. Scale N > 100 requires approximations (mean-field, graph neural networks) and aggressive parameter sharing.
John von Neumann proves the minimax theorem for two-player zero-sum games — foundation of game theory and competitive MARL.
John Nash defines the concept of equilibrium in non-cooperative games — a key learning goal concept in MARL.
Michael Littman formally defines MARL as Markov Games and introduces the minimax-Q algorithm for zero-sum games.
Generalisation of Q-learning to general-sum games with updates toward Nash equilibrium.
Lowe et al. introduce Multi-Agent DDPG with the CTDE paradigm and centralised critics — first widely adopted Deep MARL algorithm.
OpenAI presents a team of 5 PPO agents that defeats professional Dota 2 players — a Deep MARL scale breakthrough.
Rashid et al. introduce QMIX with monotonic Q-function decomposition — standard for cooperative Deep MARL.
DeepMind reaches Grandmaster level in StarCraft II — population-based self-play (league training) in competitive MARL with partial observability and a huge action space.
Yu et al. show that PPO with minor modifications achieves competitive results in MARL — strong baseline for SMAC/MPE benchmarks.
Meta AI combines MARL, language models, and strategic planning — Cicero reaches top-player level in Diplomacy, a game requiring natural-language negotiation.
Number of learning agents. Scale fundamentally affects algorithm choice — N>1000 requires mean-field or population-based methods.
Cooperative (shared), competitive (zero-sum) or mixed (general-sum). The most important taxonomic axis of MARL.
CTDE (centralised training, decentralised execution), fully centralised, fully decentralised. Determines architecture structure and information flow.
No communication, discrete messages, continuous vectors. Affects agent coordination capability.
Whether all homogeneous agents share network weights. Sharing reduces parameter count and speeds up training but limits policy heterogeneity.
Whether the agent explicitly models other agents' policies. Helps in non-stationary environments but adds complexity.
Each agent executes its policy conditioned on local observation. Under CTDE, the critic conditions on global state only during training.
MARL does not use routing in the MoE sense; "routing" only appears in the context of communication between agents (communication channels).
Self-play and population-based training (PBT, AlphaStar league) are highly parallelisable — many parallel environment instances, opponent agents, replay buffers. Gradient updates remain synchronised within a single learner.
Deep MARL uses deep neural networks for policies and critics — GPUs are optimal for matrix multiplication and parallel evaluation of multiple agents.
Multi-agent environment simulation (PettingZoo, SMAC, MPE) is CPU-bound. Population-based self-play requires hundreds of parallel CPU actors + a GPU learner.
TPUs used by DeepMind for AlphaStar and population-based training with large batch sizes — high parallel scale.