Each task consists of 2-5 demonstration pairs (colored pixel grids: input โ output) and one or more test cases. The system must discover the rule governing the transformation and apply it. Answers are digital grids (up to 30x30 pixels, 10 colors). Scoring: binary success/fail per task; score is % of solved tasks.
Absence of a benchmark resistant to "buying scores" through massive training data; existing benchmarks measured stored knowledge (crystallized intelligence) instead of general reasoning ability (fluid intelligence) โ preventing assessment of progress toward AGI.
Good scores on the public test set do not guarantee good performance on the private test set (ARC Prize evaluation).
Systems trained on known ARC tasks may overfit to their specific patterns without demonstrating genuine reasoning.
Francois Chollet defines intelligence as skill-acquisition efficiency and introduces the ARC benchmark.
Public Kaggle competition with $1M prize pool attracts hundreds of teams; LLM+program synthesis hybrids exceed 55%.
ARC Prize Foundation releases new benchmark versions with harder tasks as models begin saturating ARC-AGI-1.
Pixel-grid benchmark; evaluation is hardware-agnostic although solver programs may leverage GPU.