Training

MARL

1994ActivePublished: 30 May 2026Updated: 30 May 2026Published

Key innovation

Extension of Reinforcement Learning to environments with multiple simultaneous agents, where the optimal policy of a single agent depends on the evolving policies of others — bringing game theory and equilibrium concepts into reward-based learning.

How it works

Each agent i observes (potentially partial) state o_i, selects action a_i ~ π_i(·|o_i), the environment evolves according to P(s'|s, a₁, …, a_N) and returns rewards r_i = R_i(s, a₁, …, a_N). Agent objectives can be aligned (cooperation) or opposed (competition). Training is mainly done in the CTDE paradigm (centralised training, decentralised execution): during training the critic sees the global state and all actions (centralised critic), while policies π_i are local. In competitive MARL, self-play is used (agent plays against its own earlier versions — foundation of AlphaZero, AlphaStar). Main algorithm families: (a) Independent Learning (IQL) — each agent treats others as part of the environment, simple but unstable, (b) Value Decomposition (VDN, QMIX) — joint Q-function decomposes as sum/monotonic of Q_i, (c) Actor-Critic with centralised critic (MADDPG, MAPPO, COMA), (d) Communication-based — agents learn communication protocols, (e) Mean-field MARL — scale approximation for very large N. Key game-theoretic concepts: Nash equilibrium, Pareto-optimality, social welfare, correlated equilibrium.

Problem solved

How to teach a group of agents to cooperate (or compete) effectively in conditions where the environment is non-stationary from each agent's perspective because the other agents are also learning and changing their behaviour.

Components

Stochastic Game / Markov GameFormal model of multi-agent environment

Formal mathematical structure of MARL: tuple (N, S, {A_i}, P, {R_i}, γ). Extends MDP to multiple agents with individual action spaces and rewards.

Dec-POMDPDecentralised partially observable MDP — cooperative MARL with partial observability.

Zero-sum gamePure competition — sum of agent rewards = 0.

General-sum gameMixed regime with both cooperative and competitive elements.

Agent policies (π_i)Individual agent decisions

Individual policies π_i(a_i|o_i) of each agent. Under CTDE they are executed decentrally on local observations.

Official

Centralised critic / value functionPolicy gradient stabilisation

Global value function used only during training (CTDE). Has access to joint state and actions of all agents, which stabilises learning.

Official

Equilibrium conceptGame solution definition

A stable learning point from game theory: Nash, correlated equilibrium, Pareto-optimal — defines the "solution" of a multi-agent game.

Self-play loopCurriculum for competitive training

Training mechanism in competitive games — the agent plays against current and earlier versions of itself. Generates a natural difficulty curriculum.

Official

Implementation

Reference implementations

StarCraft Multi-Agent Challenge (SMAC)

Python

RLlib (Ray) — multi-agent API

Python

Official

Implementation pitfalls

Environment non-stationarityCritical

From a single agent's perspective the environment is non-stationary — other agents learn and change their policies. Naively applying Independent Q-learning breaks convergence assumptions.

Fix:CTDE with centralised critic, opponent modelling, population-based self-play with opponent buffer, stabilisation via parameter sharing.

Multi-agent credit assignmentHigh

In cooperative MARL it is hard to determine which agent contributed to the global reward. All naive methods produce lazy/free-rider behaviour.

Fix:Difference rewards, COMA (Counterfactual Multi-Agent), value decomposition (VDN, QMIX, QTRAN), Shapley-value-based credit assignment.

Combinatorial explosion of joint action spaceHigh

Joint action space grows exponentially with N: |A|^N. For 10 agents with 10 actions each, that is already 10¹⁰ joint actions — not directly tractable.

Fix:Decentralised execution with local policies, value decomposition, mean-field approximation, factored Q-functions.

Reward shaping in mixed gamesHigh

In general-sum games poorly chosen rewards lead to dominance of one agent, social dilemmas (Tragedy of the Commons) or reward hacking.

Fix:Mechanism design, opponent shaping (LOLA), inequity aversion, careful reward design and validation.

Failure to converge to equilibriumHigh

In general-sum games there is no guarantee that gradient-based learning converges to a Nash equilibrium — cycles, drift, and exploitation loops are possible.

Fix:Population-based training, fictitious play, double oracle, league training (AlphaStar), entropy regularisation.

Scalability to large NMedium

Most Deep MARL algorithms are designed for N ≤ 10–20. Scale N > 100 requires approximations (mean-field, graph neural networks) and aggressive parameter sharing.

Fix:Mean-field MARL, graph-based architectures (GNN), hierarchical MARL, attention-based agents.

Evolution

Original paper · 1994 · ICML 1994 · Michael L. Littman

Markov Games as a Framework for Multi-Agent Reinforcement Learning

Michael L. Littman

1928

Minimax theorem (von Neumann)

Inflection point

John von Neumann proves the minimax theorem for two-player zero-sum games — foundation of game theory and competitive MARL.

1950

Nash equilibrium

Inflection point

John Nash defines the concept of equilibrium in non-cooperative games — a key learning goal concept in MARL.

1994

Littman: Markov Games framework for MARL

Inflection point

Michael Littman formally defines MARL as Markov Games and introduces the minimax-Q algorithm for zero-sum games.

Markov Games as a Framework for Multi-Agent Reinforcement Learning (paper)

2003

Hu & Wellman: Nash Q-learning

Generalisation of Q-learning to general-sum games with updates toward Nash equilibrium.

2017

MADDPG (OpenAI)

Inflection point

Lowe et al. introduce Multi-Agent DDPG with the CTDE paradigm and centralised critics — first widely adopted Deep MARL algorithm.

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (paper)

2018

OpenAI Five — Dota 2

Inflection point

OpenAI presents a team of 5 PPO agents that defeats professional Dota 2 players — a Deep MARL scale breakthrough.

2018

QMIX — Value Decomposition

Rashid et al. introduce QMIX with monotonic Q-function decomposition — standard for cooperative Deep MARL.

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning (paper)

2019

AlphaStar — StarCraft II Grandmaster

Inflection point

DeepMind reaches Grandmaster level in StarCraft II — population-based self-play (league training) in competitive MARL with partial observability and a huge action space.

2021

MAPPO

Yu et al. show that PPO with minor modifications achieves competitive results in MARL — strong baseline for SMAC/MPE benchmarks.

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (paper)

2022

Cicero (Meta) — Diplomacy

Inflection point

Meta AI combines MARL, language models, and strategic planning — Cicero reaches top-player level in Diplomacy, a game requiring natural-language negotiation.