- Published on April 3, 2025
- In AI News
An AI judge evaluates the benchmark to provide scores to the agents.

Illustration by Nikhil Kumar
OpenAI unveiled PaperBench, a new benchmark to measure how well AI agents can reproduce cutting-edge AI research. This test aims to check if an AI can understand research papers, write code, and execute them to match the paper’s results.
PaperBench uses 20 top papers from the International Conference on Machine Learning (ICML) 2024, covering 12 different topics. The research paper contains 8,316 individually gradable tasks. Rubric, an objective evaluation system, was developed to decompose each task hierarchically into smaller subtasks with clear grading criteria. These were co-developed with the authors of each ICML paper for accuracy and realism.
The AI has to get the details from the paper and submit all the code required to reproduce the paper in a repository. The benchmark needs the AI to also create a ‘reproduce.sh’ script to help execute the code, which could potentially reproduce the results of the paper successfully.
All of this was decided to be evaluated by an AI judge, which OpenAI claims to be as close as a human judge. “Our best LLM-based judge, which uses o3-mini-high with custom scaffolding, achieves an F1 score of 0.83 on the auxiliary evaluation, suggesting that this judge is a reasonable stand-in for a human judge,” the research paper stated.
Several AI models were tested on PaperBench. The best performing model was Anthropic’s Claude 3.5 Sonnet, which achieved a 21.0% replication score. Other models, including OpenAI’s o1, GPT-4o, Gemini 2.0 Flash, and DeepSeek-R1, scored lower.

In comparison, human PhDs in machine learning scored 41.4% on average, suggesting that current AI is far from human expertise.
A separate test was also conducted with OpenAI’s o1 for extended duration, which still failed to match the human attempt.

PaperBench’s code is available to the public on GitHub. A lightweight version of the benchmark, PaperBench Code-Dev, is also available for more people to use.
📣 Want to advertise in AIM? Book here
Ankush Das
I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.
Related Posts
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Happy Llama 2025
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Cypher India 2025
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
MLDS 2026
India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru
Rising 2026
India's Biggest Summit on Women in Tech & AI 📍 Bengaluru