Robots AtlasRobots Atlas

MMLU-Pro

Extended MMLU version eliminating trivial and noisy questions, expanding answer choices from 4 to 10, and enriching the dataset with reasoning-focused questions (not just knowledge), dropping model accuracy by 16-33% and restoring benchmark discriminative power.

Category
Abstraction level
Operation level
frontier LLM evaluationmulti-step reasoning testingfrontier model comparisonCoT assessment

The dataset extends MMLU by: (1) consolidating with external sources to remove trivial questions; (2) expanding options to 10 per question; (3) adding multi-step reasoning questions. Models are evaluated zero-shot and with CoT; results show CoT is more effective on MMLU-Pro than on original MMLU.

Saturation of the original MMLU by frontier models (>85-90%) and its sensitivity to prompt variations, making it impossible to distinguish capabilities between top models.

Common pitfalls

10 choices increase token cost in few-shot
LOW

A prompt with 10 answer choices is longer, increasing evaluation cost for few-shot with long examples.

Use zero-shot CoT or reduced few-shot (1-3 examples).

GENESIS Β· Source paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
2024NeurIPS 2024 (Datasets and Benchmarks Track, Spotlight)Yubo Wang, Xueguang Ma, Ge Zhang et al.
2024

MMLU-Pro published (June 2024, NeurIPS 2024 Spotlight)

breakthrough

Wang et al. publish the enhanced MMLU with 10 choices and reasoning questions; model scores drop 16-33%.

Hardware agnosticPRIMARY

Text-based benchmark independent of evaluation hardware.

EXTENDS

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT
GPQA

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a benchmark developed by Rein et al. (2023) containing 448 multiple-choice questions written by domain experts with or pursuing PhDs in biology, physics, and chemistry. Questions are designed to be "Google-proof" – highly skilled non-expert validators reached only 34% accuracy after 30 minutes of unrestricted web search, while domain experts reached 65% (74% discounting clear retrospective mistakes). The strongest GPT-4 configuration achieved 39% in the original paper. GPQA is used as a measure of model capabilities for scalable oversight tasks, when AI may surpass human supervisor skills.

GO TO CONCEPT