Expert mathematicians create original problems outside the scope of existing published materials. Each problem has a verifiable answer (number, formula, mathematical object). Results are checked automatically using a Python/Mathematica interpreter. Questions are not released to public AI models until after they have answered.
Saturation of existing mathematical benchmarks (e.g. MATH, AMC) by frontier models; absence of a reliable measure of the distance between AI capabilities and those of contemporary research mathematicians.
FrontierMath does not release questions publicly to prevent contamination, requiring controlled access for evaluation.
Glazer et al. from Epoch AI introduce the research mathematics benchmark; frontier AI solves <2% of problems.
Math benchmark independent of hardware; verification via Python interpreter.