The dataset extends MMLU by: (1) consolidating with external sources to remove trivial questions; (2) expanding options to 10 per question; (3) adding multi-step reasoning questions. Models are evaluated zero-shot and with CoT; results show CoT is more effective on MMLU-Pro than on original MMLU.
Saturation of the original MMLU by frontier models (>85-90%) and its sensitivity to prompt variations, making it impossible to distinguish capabilities between top models.
A prompt with 10 answer choices is longer, increasing evaluation cost for few-shot with long examples.
Wang et al. publish the enhanced MMLU with 10 choices and reasoning questions; model scores drop 16-33%.
Text-based benchmark independent of evaluation hardware.