Robots AtlasRobots Atlas

AGIEval

First benchmark grounded in real-world human qualification exams (college entrance, LSAT, SAT, math competitions, bar exams) rather than artificially constructed tasks, enabling evaluation of AI models in the context of tasks with genuine societal relevance.

Category
Abstraction level
Operation level
foundation model evaluationAI vs human comparisonknowledge and reasoning testingbilingual evaluation

The dataset contains questions from official exams, grouped by type: multiple-choice (MC), free-text, and math problems. Models are evaluated both zero-shot and few-shot. Results are compared against average human performance for each exam.

Artificial benchmarks do not reflect the difficulty of tasks AI models may encounter in real-world applications. AGIEval places evaluation in the context of human cognition and decisions by using exams designed to assess human competencies.

Common pitfalls

Chinese language in subset of tasks
MEDIUM

A subset of tasks is in Chinese, which may skew results for models weaker in that language.

Report scores separately for EN and ZH subsets.

GENESIS Β· Source paper

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
2023arXiv 2023Wanjun Zhong, Ruixiang Cui, Yiduo Guo et al.
2023

AGIEval published (arXiv, April 2023)

breakthrough

Zhong et al. from Microsoft Research introduce the qualification exam benchmark. GPT-4 surpasses human average on SAT and LSAT.

Hardware agnosticPRIMARY

Text-based benchmark independent of hardware.

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT
GPQA

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a benchmark developed by Rein et al. (2023) containing 448 multiple-choice questions written by domain experts with or pursuing PhDs in biology, physics, and chemistry. Questions are designed to be "Google-proof" – highly skilled non-expert validators reached only 34% accuracy after 30 minutes of unrestricted web search, while domain experts reached 65% (74% discounting clear retrospective mistakes). The strongest GPT-4 configuration achieved 39% in the original paper. GPQA is used as a measure of model capabilities for scalable oversight tasks, when AI may surpass human supervisor skills.

GO TO CONCEPT