Evaluation

GPQA

2023ActivePublished

Key innovation

First "Google-proof" PhD-level benchmark where even highly skilled non-expert validators only reach 34% accuracy after 30 minutes of unrestricted web search, testing deep specialist knowledge of AI models that cannot be found by simple web lookup.

Category

Evaluation

Abstraction level

Pattern

Operation level

Evaluation (runtime)

Use cases

frontier AI evaluationscalable oversight researchspecialist knowledge testingsafety evaluation

How it works

Questions are created by domain experts and validated by other experts and non-experts. For each question, accuracy was measured for domain experts, non-experts with internet access, and AI models. Format: multiple-choice with 4 options. The benchmark has three subsets: GPQA Diamond (hardest), GPQA Expert (medium), GPQA Extended.

Problem solved

Absence of a benchmark evaluating deep specialist knowledge at PhD level, where typical AI models cannot "bypass" difficulty through information lookup, crucial for scalable oversight research.

Implementation

Reference implementations

GPQA – Hugging Face Dataset

Implementation pitfalls

Small dataset size (448 questions)Medium

Small dataset size can cause high variance in results between runs.

Fix:Run multiple trials and report confidence intervals.

Critical subset distinctionHigh

Scores on GPQA Diamond vs Extended differ substantially; reporting a score without specifying the subset is misleading.

Fix:Always report the subset name alongside the score.

Evolution

Original paper · 2023 · arXiv 2023 · David Rein

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

2023

GPQA published (arXiv, November 2023)

Inflection point

Rein et al. introduce 448 PhD-level questions; GPT-4 achieves 39%, non-experts 34%.

2024

GPQA Diamond becomes standard frontier AI benchmark

GPT-4o, Claude 3 Opus, and Gemini Ultra report GPQA Diamond scores as a frontier capabilities measure.

Sources

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

GPQA dataset – Hugging Face