Evaluation

BIG-Bench

2022ActivePublished: 6 May 2026Updated: 6 May 2026Published

Key innovation

First crowdsourced LLM evaluation benchmark assembled by 450+ authors from 132 institutions, containing 204+ tasks designed to exceed current model capabilities and measure emergent abilities at scale.

How it works

1. Task crowdsourcing: researchers from 132 institutions contribute tasks in a standardized JSON format (multiple choice, generative, programmatic). Each task includes description, examples, metrics, and ground truth. 2. Validation: the central team verifies that the task is hard for current models (GPT-2, GPT-3 baseline) and has clear evaluation criteria. 3. Distribution: tasks published as a GitHub repository (Apache 2.0) with a library for running benchmarks via model APIs. 4. Evaluation: the model is prompted with each task (zero-shot or few-shot); output is scored against ground truth by task metrics (accuracy, ROUGE, BLEU, exact match). 5. Aggregation: results published on the BIG-Bench leaderboard, broken down by task category and analyzed for emergence as a function of parameter scale. 6. BBH (BIG-Bench Hard): a 23-task subset where standard CoT prompting yields clearly better results than direct prompting — the canonical reasoning suite.

Problem solved

Pre-2022 LLM benchmarks (GLUE, SuperGLUE) saturated quickly with larger models and did not test a broad spectrum of capabilities. The community lacked a benchmark sufficiently hard, diverse, and open to track progress across multiple model generations. BIG-Bench addressed this by crowdsourcing 200+ tasks specifically chosen as hard, spanning domains from mathematics to theory of mind.

Components

Task suiteMain evaluation task collection

204+ tasks in a standardized JSON format, each with metadata (author, category, metrics, prompt template, ground truth).

BIG-Bench Hard (BBH)Curated subset for reasoning evaluation

23 tasks where standard prompting underperforms humans; CoT prompting significantly improves performance. The canonical reasoning test (Suzgun et al. 2022).

Evaluation harnessLibrary for running evaluations

Python framework integrating with model APIs (OpenAI, Anthropic, HuggingFace), supporting multiple choice, generative scoring, and programmatic evaluation.

Lite subset (BIG-Bench Lite)Cost-efficient subset for fast evaluation

24 tasks optimized for low evaluation cost (small example counts) while preserving the diversity of full BIG-Bench.

Implementation

Reference implementations

BIG-bench (official GitHub repository)

Python · Google + BIG-bench collaboration

Official

BIG-Bench Hard (BBH) — prompts repository

Python · Mirac Suzgun et al.

Official

lm-evaluation-harness (BIG-Bench tasks integration)

Python · EleutherAI

HELM — Holistic Evaluation of Language Models (Stanford CRFM)

Python · Stanford CRFM

Implementation pitfalls

Data contamination — BIG-Bench tasks in pretraining corpusHigh

The BIG-Bench repository has been publicly available on GitHub since 2022. Models trained after 2022 may have tasks in their pretraining corpus, artificially inflating scores.

Fix:Decontamination pipeline on the pretraining corpus (Brown-et-al. style — 13-gram match). Evaluate on fresh tasks (held-out, post-training).

Heterogeneous metrics — aggregation difficultyMedium

Each task has its own metric (accuracy, ROUGE, BLEU, custom). Arithmetic mean is misleading — some tasks range 0–1, others 0–100.

Fix:Apply metric normalization (calibrated score, per-task z-score) or report per category/subset.

BBH saturation on frontier modelsHigh

GPT-5, Gemini 3, Claude Opus 4 reach 95–98% average accuracy on BBH. The benchmark loses its ability to discriminate top-tier models.

Fix:Use BBH only as a sanity check; for differentiating frontier models use GPQA, MMLU-Pro, FrontierMath, ARC-AGI.

Critique of emergence as a metric artifactMedium

Schaeffer et al. (2023) showed that some "emergence jumps" on BIG-Bench arise from discrete metrics (accuracy) — under continuous metrics (cross-entropy), behavior is smooth.

Fix:Report both accuracy and continuous metrics (negative log-likelihood). Be cautious when claiming phase transitions.