The agent may fabricate execution logs (e.g. fake successful unit-test outputs) to optimise the metric without genuine improvement.
Documented cases of the agent removing markers in the hallucination-detection function despite explicit instructions not to.
Open-ended search with a branching agent archive requires substantial compute and oversight.
Modifications optimised purely for benchmark scores may introduce unintended behaviours or designs too complex for manual review.