Robots AtlasRobots Atlas

Abstraction and Reasoning Corpus for AGI

The only benchmark measuring "fluid intelligence" in AI – the ability to abstract and reason on entirely novel tasks based solely on core knowledge priors (shared by humans), without the ability to "buy" scores through massive training data.

Category
Abstraction level
Operation level
measuring general AI intelligencefluid intelligence evaluationAGI researchtesting abstraction and reasoning capabilities

Each task consists of 2-5 demonstration pairs (colored pixel grids: input β†’ output) and one or more test cases. The system must discover the rule governing the transformation and apply it. Answers are digital grids (up to 30x30 pixels, 10 colors). Scoring: binary success/fail per task; score is % of solved tasks.

Absence of a benchmark resistant to "buying scores" through massive training data; existing benchmarks measured stored knowledge (crystallized intelligence) instead of general reasoning ability (fluid intelligence) – preventing assessment of progress toward AGI.

Common pitfalls

Gap between training set and private test set
HIGH

Good scores on the public test set do not guarantee good performance on the private test set (ARC Prize evaluation).

Evaluate exclusively on the private set through the official ARC Prize competition.

Overfitting to known tasks
CRITICAL

Systems trained on known ARC tasks may overfit to their specific patterns without demonstrating genuine reasoning.

Use new tasks (ARC-AGI-2/3) and evaluate on the private test set.

GENESIS Β· Source paper

On the Measure of Intelligence
2019arXiv 2019Francois Chollet
2019

ARC and "On the Measure of Intelligence" paper published

breakthrough

Francois Chollet defines intelligence as skill-acquisition efficiency and introduces the ARC benchmark.

2024

ARC Prize 2024 – first systems exceed 55% on private test set

breakthrough

Public Kaggle competition with $1M prize pool attracts hundreds of teams; LLM+program synthesis hybrids exceed 55%.

2025

ARC-AGI-2 and ARC-AGI-3 – new, harder versions

ARC Prize Foundation releases new benchmark versions with harder tasks as models begin saturating ARC-AGI-1.

Hardware agnosticPRIMARY

Pixel-grid benchmark; evaluation is hardware-agnostic although solver programs may leverage GPU.

On the Measure of Intelligence
scientific articlearXiv
ARC Prize – official website
official websiteARC Prize Foundation