AGIEval

First benchmark grounded in real-world human qualification exams (college entrance, LSAT, SAT, math competitions, bar exams) rather than artificially constructed tasks, enabling evaluation of AI models in the context of tasks with genuine societal relevance.

Common pitfalls

Chinese language in subset of tasks

MEDIUM

A subset of tasks is in Chinese, which may skew results for models weaker in that language.

Report scores separately for EN and ZH subsets.

Reference implementations

AGIEval – GitHub repositoryofficial

Python

GENESIS · Source paper

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

2023arXiv 2023Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.

2023

AGIEval published (arXiv, April 2023)

breakthrough

Zhong et al. from Microsoft Research introduce the qualification exam benchmark. GPT-4 surpasses human average on SAT and LSAT.

Hardware agnosticPRIMARY

Text-based benchmark independent of hardware.

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT

GPQA

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a benchmark developed by Rein et al. (2023) containing 448 multiple-choice questions written by domain experts with or pursuing PhDs in biology, physics, and chemistry. Questions are designed to be "Google-proof" – highly skilled non-expert validators reached only 34% accuracy after 30 minutes of unrestricted web search, while domain experts reached 65% (74% discounting clear retrospective mistakes). The strongest GPT-4 configuration achieved 39% in the original paper. GPQA is used as a measure of model capabilities for scalable oversight tasks, when AI may surpass human supervisor skills.

GO TO CONCEPT