Meta Denies Any Wrongdoing in Llama 4 Benchmarks

1 week ago 12

Published on April 8, 2025
In AI News

Meta attributed the mixed performance reports to implementation stability rather than flaws in the training process.

9 Must-Know Open Source Models From Meta in 2023

Illustration by Nikhil Kumar

Meta has denied allegations that its Llama 4 models were trained on benchmark test sets. In a post on X, Ahmad Al-Dahle, Meta’s VP of GenAI, said, “We’ve also heard claims that we trained on test sets — that’s simply not true, and we would never do that.” He added that the company released the models as soon as they were ready and that “it’ll take several days for all the public implementations to get dialed in.” Meta attributed the mixed performance reports to implementation stability rather than flaws in the training process.

Meta recently launched two new Llama 4 models, Scout and Maverik.

Maverick quickly reached the second spot on LMArena, the AI benchmark platform where users vote on the best responses in head-to-head model comparisons. In its press release, Meta pointed to Maverick’s ELO score of 1417, ranking it above OpenAI’s GPT-4o and just below Gemini 2.5 Pro.

However, the version of Maverick evaluated on LMArena isn’t identical to what Meta has made publicly available. In its blog post, Meta said that it used an “experimental chat version” tailored to improve “conversationality.”

Chatbot Arena, run by lmarena.ai (formerly lmsys.org), acknowledged community concerns and shared over 2,000 head-to-head battle results for review. “To ensure full transparency, we’re releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences,” the company said.

They also said Meta’s interpretation of Arena’s policies did not align with expectations, prompting a leaderboard policy update to ensure fair and reproducible future evaluations.

“In addition, we’re also adding the HF version of Llama-4-Maverick to Arena, with leaderboard results published shortly. Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customised model to optimise for human preference,” the company said.

The drama around Llama 4 benchmarks started when a now-viral Reddit post citing a Chinese report , allegedly from a Meta employee involved in Llama 4’s development, claiming internal pressure to blend benchmark test sets during post-training.

“Company leadership suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics,” the post read. In the report, the employee wrote that they had submitted their resignation and requested to be excluded from the technical report.

AIM reached out to Meta sources and confirmed that the employee has not left the company, and the Chinese post is fake.

However, several AI researchers have noted a difference between the benchmarks reported by Meta and the ones they observed. “Llama 4 on LMSys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself,” said a user on X.

“4D chess move: use Llama 4 experimental to hack LMSys, expose the slop preference, and finally discredit the entire ranking system,” quipped Susan Zhang, senior staff research engineer at Google DeepMind.

Questions were also raised about the weekend release of Llama 4, as tech giants usually make announcements on weekdays. It is also said that Meta was under pressure to release Llama 4 before DeepSeek launches its next reasoning model, R2. Meanwhile, Meta has announced that it will release its reasoning model soon.

Before the release of Llama 4, The Information had reported that Meta had pushed back the release date at least twice, as the model didn’t perform as well on technical benchmarks as hoped—particularly in reasoning and math tasks. Meta has also had concerns that Llama 4 is less capable than OpenAI’s models at conducting humanlike voice conversations.

📣 Want to advertise in AIM? Book here

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed