Robots Atlas>ROBOTS ATLAS
Other

Co-improvement

1990ActivePublished: 17 May 2026Updated: 17 May 2026Published
Category
Other
Abstraction level
Pattern
Operation level
Post-trainingAgent runtimeSystem
Use cases
Adversarial co-evolution of code and tests — code generator vs. test generator (Code-A1, BACE)Co-evolution of policy and internal reward in LLM agents (Self-Guide)Self-play with evolving task difficulty (G-Zero, SEIF, SAGE)Multi-agent systems with simultaneous adaptation of agent capabilities and communication topology (TacoMAS)Agent memory + retrieval mechanism co-evolve (EvolveMem, Mem²Evolve)Curriculum learning in which the simulation environment adapts to learner progress (SimWorld Studio)Classical roots: predator-prey GAs (Hillis 1990), AlphaGo Zero self-play (2017)

How it works

At least two components are defined with asymmetric objectives (e.g. generator ↔ solver, policy ↔ reward, code ↔ test). Each has its own learning algorithm (RL, DPO, fine-tuning) and a loss depending on the other. The training loop updates them alternately — often with stabilizing mechanisms (replay buffer, anchoring on minimal public examples, restricted topology update rate) to prevent co-evolutionary drift and degeneration (trivial challenges, self-collusion).

Problem solved

A single model trained on a static dataset quickly hits a ceiling: it lacks signal harder than what it already masters. Co-evolution generates that signal from a second, parallel-evolving component.

Components

Components with asymmetric roles

At least two modules (models/agents/networks) with distinct objectives, e.g. generator and critic, code and test, policy and reward.

Coupled objective function

Each component's loss depends on the other's current behavior, so that improvement in one forces adaptation in the other.

Alternating training loop

Schedule of component updates (simultaneous, alternating, or at different time scales — fast/slow loop in TacoMAS).

Stabilization mechanisms

Replay buffer (Mistake Book), anchoring on public examples (BACE), revert-on-regression, exploration constraints — protect against co-evolutionary drift and self-collusion.

External verifier or asymmetric access

Independent ground-truth source (compiler, unit tests, environment reward) or structural information asymmetry (e.g. Checker without access to Solver in MARCH) — the foundation of honest signal.

Implementation

Implementation pitfalls
Co-evolutionary driftHigh

Components can drift away from external reality and mutually optimize trivial or pathological signals (e.g. challenges unsolvable for both).

Self-collusionHigh

In white-box setups (one model generates both code and tests) components "collude" — tests become trivially satisfiable. Mitigation: model separation, information asymmetry (MARCH).

Cross-component reward hackingHigh

Component A may discover ways to maximize signal from B without actually solving the task — particularly risky when B is a weak proxy for external truth.

Training instabilityMedium

Simultaneous training of multiple components with differing gradients often diverges; fast/slow schedules (TacoMAS), revert-on-regression (EvolveMem), or anchoring (BACE) are needed.

Lack of external ground truthHigh

In open-ended domains co-evolution without a verifier leads to echo chambers; intrinsic rewards (Hint-δ in G-Zero) or structural asymmetry help.

Evolution

Original paper · 1990 · W. Daniel Hillis
Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure
W. Daniel Hillis
1990
W. D. Hillis — co-evolutionary genetic algorithm with a predator-prey relation solves sorting faster than a classical GA. Conceptual start of co-evolution in computation.
Inflection point
2014
Generative Adversarial Networks (Goodfellow et al., NeurIPS 2014) — generator and discriminator co-evolve in a min-max game; flagship adversarial co-evolution example in deep learning.
Inflection point
2017
AlphaGo Zero (Silver et al., Nature) — pure self-play as a form of co-evolution with oneself; surpasses human-level without expert data.
Inflection point
2017
Population-Based Training (Jaderberg et al., DeepMind) — a population of agents co-evolves hyperparameters and weights.
2020
POET / Enhanced POET (Wang et al., Uber AI) — environment and agent grow together; explicit agent ↔ task co-evolution.
2025
Surge in LLM co-evolution papers: Code-A1, BACE, Self-Guide, G-Zero, SEIF, TacoMAS, Mem²Evolve, EvolveMem — emergence as a standard pattern in LLM agent self-improvement.
Inflection point
2026
BACE (GECCO 2026) and Mem²Evolve (ACL 2026) — co-evolution enters mainstream NLP and evolutionary-computation venues.

Execution paradigm

Primary mode
mixture
Activation pattern
stage_dependent

Parallelism

Parallelism level
partially_parallel
Scope
trainingacross_devices

Hardware requirements

Primary
Good fit