- Published on April 8, 2025
- In AI News
Meta attributed the mixed performance reports to implementation stability rather than flaws in the training process.
Illustration by Nikhil Kumar
Meta has denied allegations that its Llama 4 models were trained on benchmark test sets. In a post on X, Ahmad Al-Dahle, Meta’s VP of GenAI, said, “We’ve also heard claims that we trained on test sets — that’s simply not true, and we would never do that.” He added that the company released the models as soon as they were ready and that “it’ll take several days for all the public implementations to get dialed in.” Meta attributed the mixed performance reports to implementation stability rather than flaws in the training process.
Meta recently launched two new Llama 4 models, Scout and Maverik.
Maverick quickly reached the second spot on LMArena, the AI benchmark platform where users vote on the best responses in head-to-head model comparisons. In its press release, Meta pointed to Maverick’s ELO score of 1417, ranking it above OpenAI’s GPT-4o and just below Gemini 2.5 Pro.
However, the version of Maverick evaluated on LMArena isn’t identical to what Meta has made publicly available. In its blog post, Meta said that it used an “experimental chat version” tailored to improve “conversationality.”
Chatbot Arena, run by lmarena.ai (formerly lmsys.org), acknowledged community concerns and shared over 2,000 head-to-head battle results for review. “To ensure full transparency, we’re releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences,” the company said.
They also said Meta’s interpretation of Arena’s policies did not align with expectations, prompting a leaderboard policy update to ensure fair and reproducible future evaluations.
“In addition, we’re also adding the HF version of Llama-4-Maverick to Arena, with leaderboard results published shortly. Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customised model to optimise for human preference,” the company said.
The drama around Llama 4 benchmarks started when a now-viral Reddit post citing a Chinese report , allegedly from a Meta employee involved in Llama 4’s development, claiming internal pressure to blend benchmark test sets during post-training.
“Company leadership suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics,” the post read. In the report, the employee wrote that they had submitted their resignation and requested to be excluded from the technical report.
AIM reached out to Meta sources and confirmed that the employee has not left the company, and the Chinese post is fake.
However, several AI researchers have noted a difference between the benchmarks reported by Meta and the ones they observed. “Llama 4 on LMSys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself,” said a user on X.
“4D chess move: use Llama 4 experimental to hack LMSys, expose the slop preference, and finally discredit the entire ranking system,” quipped Susan Zhang, senior staff research engineer at Google DeepMind.
Questions were also raised about the weekend release of Llama 4, as tech giants usually make announcements on weekdays. It is also said that Meta was under pressure to release Llama 4 before DeepSeek launches its next reasoning model, R2. Meanwhile, Meta has announced that it will release its reasoning model soon.
Before the release of Llama 4, The Information had reported that Meta had pushed back the release date at least twice, as the model didn’t perform as well on technical benchmarks as hoped—particularly in reasoning and math tasks. Meta has also had concerns that Llama 4 is less capable than OpenAI’s models at conducting humanlike voice conversations.
📣 Want to advertise in AIM? Book here
Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Happy Llama 2025
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Cypher India 2025
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
MLDS 2026
India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru
Rising 2026
India's Biggest Summit on Women in Tech & AI 📍 Bengaluru