Graduate-Level Google-Proof Q&A Benchmark
First "Google-proof" PhD-level benchmark where even highly skilled non-expert validators only reach 34% accuracy after 30 minutes of unrestricted web search, testing deep specialist knowledge of AI models that cannot be found by simple web lookup.
Questions are created by domain experts and validated by other experts and non-experts. For each question, accuracy was measured for domain experts, non-experts with internet access, and AI models. Format: multiple-choice with 4 options. The benchmark has three subsets: GPQA Diamond (hardest), GPQA Expert (medium), GPQA Extended.
Absence of a benchmark evaluating deep specialist knowledge at PhD level, where typical AI models cannot "bypass" difficulty through information lookup, crucial for scalable oversight research.
Common pitfalls
Small dataset size (448 questions)MEDIUM
Small dataset size can cause high variance in results between runs.
Run multiple trials and report confidence intervals.
Critical subset distinctionHIGH
Scores on GPQA Diamond vs Extended differ substantially; reporting a score without specifying the subset is misleading.
Always report the subset name alongside the score.
Reference implementations
GENESIS Β· Source paper
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkGPQA published (arXiv, November 2023)
breakthroughRein et al. introduce 448 PhD-level questions; GPT-4 achieves 39%, non-experts 34%.
GPQA Diamond becomes standard frontier AI benchmark
GPT-4o, Claude 3 Opus, and Gemini Ultra report GPQA Diamond scores as a frontier capabilities measure.
Text-based benchmark independent of evaluation hardware.
Commonly used with
MMLU
MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).
GO TO CONCEPTMMLU-Pro
MMLU-Pro is an enhanced benchmark introduced by Wang et al. (2024), developed in response to saturation of the original MMLU by modern models. Key changes from MMLU: (1) answer choices expanded from 4 to 10, reducing random guessing effectiveness; (2) trivial and noisy questions removed; (3) reasoning-focused multi-step questions added, where CoT outperforms direct answering (unlike original MMLU). MMLU-Pro causes a 16-33% drop in model accuracy compared to MMLU and reduces score sensitivity to prompt variations from 4-5% to 2%. Accepted at NeurIPS 2024 (Spotlight).
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | arXiv | scientific article |
| GPQA dataset β Hugging Face | Hugging Face | repository |