Other

Co-improvement

1990ActivePublished: 17 May 2026Updated: 17 May 2026Published

How it works

At least two components are defined with asymmetric objectives (e.g. generator ↔ solver, policy ↔ reward, code ↔ test). Each has its own learning algorithm (RL, DPO, fine-tuning) and a loss depending on the other. The training loop updates them alternately — often with stabilizing mechanisms (replay buffer, anchoring on minimal public examples, restricted topology update rate) to prevent co-evolutionary drift and degeneration (trivial challenges, self-collusion).

Problem solved

A single model trained on a static dataset quickly hits a ceiling: it lacks signal harder than what it already masters. Co-evolution generates that signal from a second, parallel-evolving component.

Components

Components with asymmetric roles

At least two modules (models/agents/networks) with distinct objectives, e.g. generator and critic, code and test, policy and reward.

Coupled objective function

Each component's loss depends on the other's current behavior, so that improvement in one forces adaptation in the other.

Alternating training loop

Schedule of component updates (simultaneous, alternating, or at different time scales — fast/slow loop in TacoMAS).

Stabilization mechanisms

Replay buffer (Mistake Book), anchoring on public examples (BACE), revert-on-regression, exploration constraints — protect against co-evolutionary drift and self-collusion.

External verifier or asymmetric access

Independent ground-truth source (compiler, unit tests, environment reward) or structural information asymmetry (e.g. Checker without access to Solver in MARCH) — the foundation of honest signal.

Implementation

Reference implementations

EvolveMem (SimpleMem)

Implementation pitfalls

Co-evolutionary driftHigh

Components can drift away from external reality and mutually optimize trivial or pathological signals (e.g. challenges unsolvable for both).

Self-collusionHigh

In white-box setups (one model generates both code and tests) components "collude" — tests become trivially satisfiable. Mitigation: model separation, information asymmetry (MARCH).

Cross-component reward hackingHigh

Component A may discover ways to maximize signal from B without actually solving the task — particularly risky when B is a weak proxy for external truth.

Training instabilityMedium

Simultaneous training of multiple components with differing gradients often diverges; fast/slow schedules (TacoMAS), revert-on-regression (EvolveMem), or anchoring (BACE) are needed.

Lack of external ground truthHigh

In open-ended domains co-evolution without a verifier leads to echo chambers; intrinsic rewards (Hint-δ in G-Zero) or structural asymmetry help.

Evolution

Original paper · 1990 · W. Daniel Hillis

Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure

W. Daniel Hillis

1990

W. D. Hillis — co-evolutionary genetic algorithm with a predator-prey relation solves sorting faster than a classical GA. Conceptual start of co-evolution in computation.

Inflection point

2014

Generative Adversarial Networks (Goodfellow et al., NeurIPS 2014) — generator and discriminator co-evolve in a min-max game; flagship adversarial co-evolution example in deep learning.

Inflection point

2017

AlphaGo Zero (Silver et al., Nature) — pure self-play as a form of co-evolution with oneself; surpasses human-level without expert data.

Inflection point

2017

Population-Based Training (Jaderberg et al., DeepMind) — a population of agents co-evolves hyperparameters and weights.

2020

POET / Enhanced POET (Wang et al., Uber AI) — environment and agent grow together; explicit agent ↔ task co-evolution.

2025

Surge in LLM co-evolution papers: Code-A1, BACE, Self-Guide, G-Zero, SEIF, TacoMAS, Mem²Evolve, EvolveMem — emergence as a standard pattern in LLM agent self-improvement.

Inflection point

2026

BACE (GECCO 2026) and Mem²Evolve (ACL 2026) — co-evolution enters mainstream NLP and evolutionary-computation venues.