ARC Prize, a non-profit organisation that evaluates the effectiveness of AI models to demonstrate human-like intelligence, has announced the ARC-AGI-2 benchmark.
The new benchmark is a successor to the ARC-AGI benchmark released a few years ago. Like its predecessor, the benchmark tests AI models on tasks that are relatively easy models for humans but difficult for artificial systems.
The ARC-AGI-2 benchmark poses even greater challenges than its predecessor, as it factors in efficiency (cost-per-task) in addition to performance. The tasks require AI models to interpret symbols beyond their visual patterns, simultaneously apply interrelated rules, and use different rules depending on context.
The results revealed that AI models found all of the above tasks challenging. Non-reasoning models, or ‘Pure LLMs’, scored 0% on the benchmark, while other publicly available reasoning models received single-digit percentage scores of less than 4%. In contrast, a human panel solving the tasks achieved a perfect score of 100%.
“AI systems are already superhuman in many specific domains (e.g., playing Go and image recognition.) However, these are narrow, specialised capabilities. The ‘human-ai gap’ reveals what’s missing for general intelligence—highly efficiently acquiring new skills,” the organisation said.
OpenAI’s unreleased o3 reasoning model achieved the highest score of 4.0%. In the previous ARC-AGI-1 benchmark, it scored 75.7%. However, Sam Altman, CEO of OpenAI, has disclosed that it will not be released as a standalone model. Instead, o3’s reasoning capabilities will be integrated into a hybrid GPT-5 model.
Besides, there weren’t any noteworthy scores from other AI models. Even the recently released Claude 3.7 Sonnet model, often considered the best model for coding, scored 0.7%, whereas the DeepSeek-R1 model scored 1.3%. The leaderboard also outlined the cost (in USD) taken to perform each task.

Source: ARC-Prize
“All other AI benchmarks focus on superhuman capabilities or specialised knowledge by testing ‘PhD++’ skills. ARC-AGI is the only benchmark that takes the opposite design choice by focusing on tasks that are relatively easy for humans, yet hard, or impossible, for AI,” the organisation added.
François Chollet, creator of Keras and a former Google researcher, is one of the creators of the ARC-AGI benchmark. He said it is “the only AI benchmark that measures progress towards general intelligence”.
Recently, Chollet, along with Zapier co-founder Mike Knoop, launched Ndea, a new research lab dedicated to creating artificial general intelligence (AGI).