- Published on January 19, 2025
- In AI News
The company has had prior access to datasets of a benchmark the o3 model scored record results on.

OpenAI’s o3 benchmark controversy is starting to look like a Theranos moment—claiming record-breaking performance on EpochAI’s FrontierMath benchmark while having access to much of the test data, and funding the same.
Epoch AI’s associate director, Tamay Besiroglu admitted they were contractually restricted from disclosing OpenAI’s involvement, while six contributing mathematicians revealed they were unaware of the exclusive access.
Besiroglu said, “We made a mistake in not being more transparent about OpenAI’s involvement. “He revealed that the company was restricted from disclosing the partnership until the o3 model was launched.
“Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future,” he added.
Besiroglu also acknowledged that OpenAI had access to a large portion of the FrontierMath problems and solutions. However, an ‘unseen-by-OpenAI hold-out set’ helped verify the model’s capabilities.
“Six mathematicians who significantly contributed to the FrontierMath benchmark confirmed this is true – that they are unaware that OpenAI will have exclusive access to this benchmark (and others won’t). Most express they are not sure they would have contributed had they known,” revealed Carina Hong, a PhD candidate at Stanford, on X.
AI experts like Gary Marcus are questioning the legitimacy of OpenAI’s claims, comparing the situation directly to Theranos.
In December last year, when OpenAI announced its new o3 family of models, the company claimed that the o3 achieved an impressive 25% accuracy on the EpochAI Frontier Math benchmark. It was a huge leap over the previous high scores of just 2% from other powerful models. The benchmark assigns LLMs to solve mathematical problems of unprecedented difficulty.
In an exclusive interaction with AIM earlier, Besiroglu revealed that Epoch AI significantly reduces data contamination issues by producing novel problems in the benchmark. He also said, “The [benchmark] data is private, so it’s not used for training.”
A user on LessWrong discovered that the latest version of FrontierMath’s research paper explaining the benchmark included a footnote stating, “We gratefully acknowledge OpenAI for their support in creating the benchmark.”
Mikhail Samin, executive director at the AI Governance and Safety Institute, said on X that “OpenAI has a history of misleading behaviour- from deceiving its own board to secret non-disparagement agreements that former employees had to sign- so I guess this shouldn’t be too surprising.”
OpenAI also claimed the o3 model scored almost 90% on the ARC-AGI benchmark, exceeding human performance. The benchmark is said to be the “only AI benchmark that measures progress towards general intelligence.” However, François Chollet, creator of the ARC-AGI benchmark, stated, “I don’t believe this is AGI—there are still easy ARC-AGI-1 tasks that o3 can’t solve.”
Since the model’s launch, Marcus has always been scpetical of the results. Earlier, he also said “Not one person outside of OpenAI has evaluated o3’s robustness across different types of problems.”
Amid the benchmark controversy, OpenAI Sam Altman seems super excited to release o3 mini in the coming weeks.
Supreeth Koundinya
Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
February 5 – 7, 2025 | Nimhans Convention Center, Bangalore
Rising 2025 | DE&I in Tech & AI
Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru
Data Engineering Summit 2025
15-16 May, 2025 | 📍 Taj Yeshwantpur, Bengaluru, India
AI Startups Conference.
April 25 /
Hotel Radisson Blu /
Bangalore, India
17-19 September, 2025 | 📍KTPO, Whitefield, Bangalore, India
MachineCon GCC Summit 2025
19-20th June 2025 | Bangalore
Our Discord Community for AI Ecosystem.