2025 is the year of tougher AI benchmarks.

Imagine a room full of mathematicians breaking their heads over a problem. The stakes are high, and the pressure is intense. Now, picture AI stepping in, solving the problem accurately, leaving human experts stunned.
That’s precisely what happened last month. OpenAI’s o3 series models redefined how we measure intelligence—offering a glimpse of what lies ahead.
FrontierMath vs Traditional Benchmarks
OpenAI’s o3 models saturated benchmarks like ARC-AGI, SWE-bench Verified, Codeforces, and Epoch AI’s FrontierMath. The most important, however, was o3’s performance on the FrontierMath Benchmark, which is regarded as the toughest mathematical test available.
In an exclusive interaction with AIM, Epoch AI’s co-founder Tamay Besiroglu spoke about what sets their benchmark apart. “Standard math benchmarks often draw from educational content; ours is problems mathematicians find interesting (e.g. highly creative competition problems or interesting research),” he said.
He added that Epoch significantly reduces data contamination issues by producing novel problems. As existing benchmarks like MATH are close to being saturated, he claimed their dataset will likely be useful for some time.
FrontierMath problems can take hours or days for even expert mathematicians to solve. Fields medalist Terence Tao described them as exceptionally challenging, requiring a mix of human expertise, AI, and advanced algebra tools. British mathematician Timothy Gowers called them far more complex than IMO problems and beyond his own expertise.
Bullish on this particular benchmark, OpenAI’s Noam Brown said, “Even if LLMs are dumb in some ways, saturating evals like Epoch AI’s FrontierMath would suggest AI is surpassing top human intelligence in certain domains.”
The Problem of Gaming the System
AI is excellent at playing by the rules — sometimes too clever.
This means that as benchmarks become predictable, machines get good at “gaming” them: recognising patterns, finding shortcuts, and scoring high without really understanding the task.
“The data is private, so it’s not used for training,” said Besiroglu on how they tackle this problem. This makes it harder for AI to cheat the system. But as tests evolve, so do the strategies machines used to game them.
As AI surpasses human abilities in fields such as mathematics, comparisons between the two may seem increasingly less meaningful.
After o3’s performance on FrontierMath, Epoch AI has announced plans to host a competition in Cambridge in February or March 2025 to set an expert benchmark. Leading mathematicians are being invited to take part in this event.
“This tweet is exactly what you would expect to see in a world where AI capabilities are growing ….feels like the background news story in the first scene of a sci-fi drama,” said Wharton’s Ethan Mollick.
Interestingly, competitions that once celebrated human skills are increasingly influenced by AI’s capabilities, raising the question of whether humans and machines should compete separately. “Large benchmarks like FrontierMath might be more practical than competitions, given the constraints humans face compared to AI, which can tackle hundreds of problems repeatedly,” Besiroglu suggested.
People are calling this era similar to AlphaGo and Deep Blue (an IBM supercomputer). “This will be our generation’s historic Deep Blue vs Kasparov chess match, where human intellect was first bested by AI. Could redefine what we consider as the pinnacle of problem-solving,” read a post on X.
Meanwhile, the ARC-AGI benchmark announced its upgrade, ARC-AGI 2, and FrontierMath unveiled a new tier 4 for its benchmark. The AI progress is unparalleled.
2025 is the Year of Tougher Benchmarks
“We are now confident we know how to build AGI as we have traditionally understood it. We believe that in 2025, we may see the first AI agents ‘join the workforce’ and materially change the output of companies,” said OpenAI chief Sam Altman in a recent blog.
Benchmarks like FrontierMath aren’t just measuring today’s AI, they’re shaping the future. With 2025 predicted to be the year of agentic AI, it could also mark significant strides toward AGI and perhaps the first glimpses of ASI.
But are we ready for such systems? The stakes are high, and the benchmarks we create today will have long-term consequences and real-world impact.
“I think good benchmarks help provide clarity about how good AI systems are but don’t have a much direct effect on advancing the development itself,” added Besiroglu, describing the impact of these benchmarks on real-world progress.
In a podcast last year, Anthropic CPO Mike Krieger said that models are limited by evaluations and not intelligence.
To this Besiroglu clarified: “I think models are going to get a lot better over the next few years. Having strong benchmarks will provide a better understanding of this trend.”
FrontierMath is part of a larger effort to rethink how we measure intelligence. As machines get smarter, benchmarks must grow smarter, too—not just in complexity but in how they align with real-world needs.
Aditi Suresh
I hold a degree in political science, and am interested in how AI and online culture intersect. I can be reached at [email protected]
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
February 5 – 7, 2025 | Nimhans Convention Center, Bangalore
Rising 2025 | DE&I in Tech & AI
Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru
Data Engineering Summit 2025
15-16 May, 2025 | 📍 Taj Yeshwantpur, Bengaluru, India
AI Startups Conference.
April 25 /
Hotel Radisson Blu /
Bangalore, India
17-19 September, 2025 | 📍KTPO, Whitefield, Bangalore, India
MachineCon GCC Summit 2025
19-20th June 2025 | Bangalore
Our Discord Community for AI Ecosystem.