OpenAI’s Model Evaluation Partner METR Flags Potential Cheating in o3

2 days ago 6

Published on April 17, 2025
In AI News

An early evaluation revealed attempts at “reward hacking” in o3 and strong task-solving abilities in o4-mini.

The Machine Intelligence Testing for Risks (METR), an organisation that works with OpenAI to test their models, alleged that the AI company’s o3 model appears to have a greater tendency to cheat or hack tasks to boost its score.

In its blog post, the benchmarking company said the o3 evaluation was conducted in a short timeframe with limited access to information.

METR gets early access to test OpenAI models. This preliminary analysis was done three weeks before the models were made public. It used METR’s HCAST (Human-Calibrated Autonomy Software Tasks) and RE-Bench test suites to measure the models’ performance.

An early evaluation revealed attempts at “reward hacking” in o3 and strong task-solving abilities in o4-mini.

Both o3 and o4-mini performed better than Claude 3.7 Sonnet on an updated HCAST benchmark. METR said their “50% time horizons” were about “1.8x and 1.5x that of Claude 3.7 Sonnet, respectively.”

“We did not access the model’s internal reasoning, which is likely to contain important information for interpreting our results,” the company said in the blog post. “In general, we believe that pre-deployment capability testing is not a sufficient risk management strategy by itself, and we are currently prototyping additional forms of evaluations.”

METR found that o3 attempted to exploit the scoring system in some cases. The report said, “Between 1% and 2% of all task attempts by o3 across HCAST and RE-Bench contained some attempt at reward hacking.” These included “relatively sophisticated exploits against the scoring code for some tasks.”

The report noted that cheating attempts were counted as failed tasks. Without this correction, o3’s 50% time horizon would be about 5 minutes longer, and its RE-Bench score would be higher than human experts.

Recently, the Financial Times reported that OpenAI has reduced the time and resources dedicated to testing the safety of its advanced AI models, prompting concerns that the technology is being released without adequate safeguards. “We had more thorough safety testing when [the technology] was less important,” said one person who tested o3 model, the report said.

METR also raised a concern that o3 might be “sandbagging,” or deliberately underperforming. The report stated that the model “appears to have a higher propensity to cheat or hack tasks in sophisticated ways in order to maximise its score, even when the model clearly understands this behaviour is misaligned with the user’s and OpenAI’s intentions.”

In comparison, o4-mini did not show reward hacking attempts and scored well on a group of RE-Bench tasks. It performed best in “Optimise a Kernel,” which lifted its overall score.

METR said that when given 32 hours to complete this subset of tasks, “o4-mini exceeds the 50th percentile of human performance averaged across the five tasks.”

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed