Evaluation

MMLU

2021ActivePublished

Key innovation

First benchmark spanning 57 academic and professional domains, revealing that language models fail broadly on tasks requiring wide factual world knowledge despite impressive narrow-task performance.

Category

Evaluation

Abstraction level

Pattern

Operation level

Evaluation (runtime)

Use cases

LLM evaluationgeneral knowledge comparisonnatural language understanding assessmentAI research

How it works

The benchmark consists of multiple-choice questions (4 options) grouped into 57 thematic tasks. Models are evaluated in zero-shot or few-shot settings – given a question and asked to select the answer (A/B/C/D). Results are reported as percentage of correct answers per task and as a weighted average.

Problem solved

The absence of a comprehensive benchmark spanning a wide spectrum of academic and professional domains, making it impossible to reliably compare general world knowledge and problem-solving ability across large language models.

Implementation

Reference implementations

MMLU – official repository

Implementation pitfalls

Benchmark saturationHigh

Modern models exceed 85-90% on MMLU, making it insufficient to differentiate top-tier models.

Fix:Use MMLU-Pro or GPQA for more challenging evaluations.

Training data contaminationHigh

MMLU questions may have appeared in model training data, inflating scores.

Fix:Cross-reference with benchmarks using new, unpublished questions (e.g. FrontierMath).

Evolution

Original paper · 2021 · ICLR 2021 · Dan Hendrycks

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

2021

MMLU published (ICLR 2021)

Inflection point

Hendrycks et al. introduce the 57-task benchmark; GPT-3 barely beats random guessing.

2022

GPT-3.5 and PaLM exceed 70%

Large models begin clearly exceeding human-level on some categories.

2023

GPT-4 reaches ~86%, MMLU loses discriminative power

Inflection point

Benchmark saturation leads to creation of MMLU-Pro and GPQA as successors.

Sources

Measuring Massive Multitask Language Understanding

MMLU Repository