The benchmark consists of multiple-choice questions (4 options) grouped into 57 thematic tasks. Models are evaluated in zero-shot or few-shot settings – given a question and asked to select the answer (A/B/C/D). Results are reported as percentage of correct answers per task and as a weighted average.
The absence of a comprehensive benchmark spanning a wide spectrum of academic and professional domains, making it impossible to reliably compare general world knowledge and problem-solving ability across large language models.
Modern models exceed 85-90% on MMLU, making it insufficient to differentiate top-tier models.
MMLU questions may have appeared in model training data, inflating scores.
Hendrycks et al. introduce the 57-task benchmark; GPT-3 barely beats random guessing.
Large models begin clearly exceeding human-level on some categories.
Benchmark saturation leads to creation of MMLU-Pro and GPQA as successors.
The benchmark is hardware-agnostic – it evaluates model outputs on text questions without GPU/TPU requirements on the evaluation side.