Co-improvement
How it works
At least two components are defined with asymmetric objectives (e.g. generator ↔ solver, policy ↔ reward, code ↔ test). Each has its own learning algorithm (RL, DPO, fine-tuning) and a loss depending on the other. The training loop updates them alternately — often with stabilizing mechanisms (replay buffer, anchoring on minimal public examples, restricted topology update rate) to prevent co-evolutionary drift and degeneration (trivial challenges, self-collusion).
Problem solved
A single model trained on a static dataset quickly hits a ceiling: it lacks signal harder than what it already masters. Co-evolution generates that signal from a second, parallel-evolving component.
Components
At least two modules (models/agents/networks) with distinct objectives, e.g. generator and critic, code and test, policy and reward.
Each component's loss depends on the other's current behavior, so that improvement in one forces adaptation in the other.
Schedule of component updates (simultaneous, alternating, or at different time scales — fast/slow loop in TacoMAS).
Replay buffer (Mistake Book), anchoring on public examples (BACE), revert-on-regression, exploration constraints — protect against co-evolutionary drift and self-collusion.
Independent ground-truth source (compiler, unit tests, environment reward) or structural information asymmetry (e.g. Checker without access to Solver in MARCH) — the foundation of honest signal.
Implementation
Components can drift away from external reality and mutually optimize trivial or pathological signals (e.g. challenges unsolvable for both).
In white-box setups (one model generates both code and tests) components "collude" — tests become trivially satisfiable. Mitigation: model separation, information asymmetry (MARCH).
Component A may discover ways to maximize signal from B without actually solving the task — particularly risky when B is a weak proxy for external truth.
Simultaneous training of multiple components with differing gradients often diverges; fast/slow schedules (TacoMAS), revert-on-regression (EvolveMem), or anchoring (BACE) are needed.
In open-ended domains co-evolution without a verifier leads to echo chambers; intrinsic rewards (Hint-δ in G-Zero) or structural asymmetry help.