Robots AtlasRobots Atlas

FrontierMath

Expert-level mathematics benchmark of original, unpublished problems created by research mathematicians, where current frontier AI solves under 2% of problems – revealing a vast gap between AI capabilities and the prowess of the mathematical community.

Category
Abstraction level
Operation level
evaluation of advanced AI mathematical capabilitiesmeasuring AI vs expert boundarymathematical reasoning researchsafety research (superhuman reasoning)

Expert mathematicians create original problems outside the scope of existing published materials. Each problem has a verifiable answer (number, formula, mathematical object). Results are checked automatically using a Python/Mathematica interpreter. Questions are not released to public AI models until after they have answered.

Saturation of existing mathematical benchmarks (e.g. MATH, AMC) by frontier models; absence of a reliable measure of the distance between AI capabilities and those of contemporary research mathematicians.

Common pitfalls

Dataset not fully public
MEDIUM

FrontierMath does not release questions publicly to prevent contamination, requiring controlled access for evaluation.

Contact the authors to obtain evaluation access.

GENESIS Β· Source paper

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
2024arXiv 2024Elliot Glazer, Ege Erdil, Tamay Besiroglu et al.
2024

FrontierMath published (arXiv, November 2024)

breakthrough

Glazer et al. from Epoch AI introduce the research mathematics benchmark; frontier AI solves <2% of problems.

Hardware agnosticPRIMARY

Math benchmark independent of hardware; verification via Python interpreter.