Questions are created by domain experts and validated by other experts and non-experts. For each question, accuracy was measured for domain experts, non-experts with internet access, and AI models. Format: multiple-choice with 4 options. The benchmark has three subsets: GPQA Diamond (hardest), GPQA Expert (medium), GPQA Extended.
Absence of a benchmark evaluating deep specialist knowledge at PhD level, where typical AI models cannot "bypass" difficulty through information lookup, crucial for scalable oversight research.
Small dataset size can cause high variance in results between runs.
Scores on GPQA Diamond vs Extended differ substantially; reporting a score without specifying the subset is misleading.
Rein et al. introduce 448 PhD-level questions; GPT-4 achieves 39%, non-experts 34%.
GPT-4o, Claude 3 Opus, and Gemini Ultra report GPQA Diamond scores as a frontier capabilities measure.
Text-based benchmark independent of evaluation hardware.