OpenAI’s New Benchmark to Study AI Agents’ Research Capabilities

2 weeks ago 12

Published on April 3, 2025
In AI News

An AI judge evaluates the benchmark to provide scores to the agents.

Illustration by Nikhil Kumar

OpenAI unveiled PaperBench, a new benchmark to measure how well AI agents can reproduce cutting-edge AI research. This test aims to check if an AI can understand research papers, write code, and execute them to match the paper’s results.

PaperBench uses 20 top papers from the International Conference on Machine Learning (ICML) 2024, covering 12 different topics. The research paper contains 8,316 individually gradable tasks. Rubric, an objective evaluation system, was developed to decompose each task hierarchically into smaller subtasks with clear grading criteria. These were co-developed with the authors of each ICML paper for accuracy and realism.

The AI has to get the details from the paper and submit all the code required to reproduce the paper in a repository. The benchmark needs the AI to also create a ‘reproduce.sh’ script to help execute the code, which could potentially reproduce the results of the paper successfully.

All of this was decided to be evaluated by an AI judge, which OpenAI claims to be as close as a human judge. “Our best LLM-based judge, which uses o3-mini-high with custom scaffolding, achieves an F1 score of 0.83 on the auxiliary evaluation, suggesting that this judge is a reasonable stand-in for a human judge,” the research paper stated.

Several AI models were tested on PaperBench. The best performing model was Anthropic’s Claude 3.5 Sonnet, which achieved a 21.0% replication score. Other models, including OpenAI’s o1, GPT-4o, Gemini 2.0 Flash, and DeepSeek-R1, scored lower.

In comparison, human PhDs in machine learning scored 41.4% on average, suggesting that current AI is far from human expertise.

A separate test was also conducted with OpenAI’s o1 for extended duration, which still failed to match the human attempt.

PaperBench’s code is available to the public on GitHub. A lightweight version of the benchmark, PaperBench Code-Dev, is also available for more people to use.

📣 Want to advertise in AIM? Book here

Ankush Das

I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

DE&I in India’s Tech 2025

Abhijeet Adhikari

DE&I is redefining the future of India’s tech industry fueling innovation, productivity, and a more inclusive culture. As 2025 approaches, the focus shifts from intent to impact. This report explores

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Happy Llama 2025

AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India

Data Engineering Summit 2025

May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru

MachineCon GCC Summit 2025

June 20 to 22, 2025 | 📍 ITC Grand, Goa

Cypher India 2025

Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India

MLDS 2026

India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru

Rising 2026

India's Biggest Summit on Women in Tech & AI 📍 Bengaluru

Read Entire Article

OpenAI’s New Benchmark to Study AI Agents’ Research Capabilities

Ankush Das

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Related

The State of Reinforcement Learning for LLM Reasoning

GPT-4o makes beautiful images but fails basic reasoning test...

Researchers introduce COLORBENCH to test color understanding...