1. Task crowdsourcing: researchers from 132 institutions contribute tasks in a standardized JSON format (multiple choice, generative, programmatic). Each task includes description, examples, metrics, and ground truth. 2. Validation: the central team verifies that the task is hard for current models (GPT-2, GPT-3 baseline) and has clear evaluation criteria. 3. Distribution: tasks published as a GitHub repository (Apache 2.0) with a library for running benchmarks via model APIs. 4. Evaluation: the model is prompted with each task (zero-shot or few-shot); output is scored against ground truth by task metrics (accuracy, ROUGE, BLEU, exact match). 5. Aggregation: results published on the BIG-Bench leaderboard, broken down by task category and analyzed for emergence as a function of parameter scale. 6. BBH (BIG-Bench Hard): a 23-task subset where standard CoT prompting yields clearly better results than direct prompting โ the canonical reasoning suite.
Pre-2022 LLM benchmarks (GLUE, SuperGLUE) saturated quickly with larger models and did not test a broad spectrum of capabilities. The community lacked a benchmark sufficiently hard, diverse, and open to track progress across multiple model generations. BIG-Bench addressed this by crowdsourcing 200+ tasks specifically chosen as hard, spanning domains from mathematics to theory of mind.
204+ tasks in a standardized JSON format, each with metadata (author, category, metrics, prompt template, ground truth).
23 tasks where standard prompting underperforms humans; CoT prompting significantly improves performance. The canonical reasoning test (Suzgun et al. 2022).
Python framework integrating with model APIs (OpenAI, Anthropic, HuggingFace), supporting multiple choice, generative scoring, and programmatic evaluation.
24 tasks optimized for low evaluation cost (small example counts) while preserving the diversity of full BIG-Bench.
The BIG-Bench repository has been publicly available on GitHub since 2022. Models trained after 2022 may have tasks in their pretraining corpus, artificially inflating scores.
Each task has its own metric (accuracy, ROUGE, BLEU, custom). Arithmetic mean is misleading โ some tasks range 0โ1, others 0โ100.
GPT-5, Gemini 3, Claude Opus 4 reach 95โ98% average accuracy on BBH. The benchmark loses its ability to discriminate top-tier models.
Schaeffer et al. (2023) showed that some "emergence jumps" on BIG-Bench arise from discrete metrics (accuracy) โ under continuous metrics (cross-entropy), behavior is smooth.
Google announces an open call to crowdsource LLM evaluation tasks. Goal: hard, diverse, open tasks.
First public benchmark release as a GitHub repository. Evaluation of GPT-3, PaLM 540B, and several open-weight models.
Curated subset of tasks where CoT prompting yields a clear gain over direct prompting. BBH becomes the canonical reasoning test.
OpenAI reports that GPT-4 with CoT exceeds the human baseline on most BBH tasks โ the first model generation to do so.
Stanford NLP shows that some emergence jumps on BIG-Bench disappear after switching metrics from accuracy to continuous ones (e.g. cross-entropy).
Claude 3.5, Gemini 1.5 Pro, GPT-4o reach 90%+ average accuracy on BBH; demand grows for harder benchmarks (GPQA, MMLU-Pro).
Reasoning models (o1, DeepSeek-R1, Gemini 2.5 Deep Think, Claude Opus 4 Thinking) use BBH as a standard reference alongside GPQA and AIME.
Choice: full (204+ tasks), BBH (23 reasoning tasks), Lite (24 cost-efficient tasks), or a custom category-based subset.
Direct prompting vs Chain-of-Thought. BBH shows significant differences โ CoT improves results by 10โ30 pp on most tasks.
Zero-shot, 1-shot, few-shot (3โ8). Most BIG-Bench benchmarks use zero-shot or 3-shot as the standard.
Per task: exact match, multiple choice accuracy, ROUGE, BLEU, BLEURT, programmatic check, custom.
Each task and each example in the benchmark is independent โ they can be evaluated in parallel on any number of devices. The bottleneck is model API rate limits, not the benchmark itself.
BIG-Bench is an evaluation benchmark โ it requires no specific hardware. It runs wherever the model runs (GPU, TPU, CPU, remote API).