Gemini 2.5 is better than the Claude 3.7 Sonnet for coding in the Aider Polyglot leaderboard.

Illustration by Nalini Nirad
It’s a sign of trouble when just one company, with a single new feature, manages to monopolise the internet’s collective attention. For days, every social media feed was flooded exclusively with ‘Ghibli-fied’ visuals, all thanks to ChatGPT’s newly released image generation feature.
Google has gone all in. Instead of merely following OpenAI’s spotlight, Google’s latest announcement of the Gemini 2.5 family of models–the first one being the 2.5 Pro Experimental–now leads several benchmarks as the top frontier AI model.
The model ranks first in the GPQA benchmark, which tests AI models on graduate-level science questions. It scored 83%, outperforming OpenAI’s o1-Pro (79%) and Claude 3.7 Sonnet (77%) with extended thinking. Similarly, it ranked the highest in many other benchmarks.

Source: Artificial Analysis
Moreover, Gemini 2.5 is already receiving acclaim as potentially the best AI model for coding, a title that no other model besides Anthropic’s Claude has managed to claim convincingly. Could Claude 3.7 Sonnet finally be facing some genuine competition?
In the Aider Polyglot leaderboard, which evaluates LLMs’ capabilities in writing and editing code, Gemini 2.5 Pro Experimental scored 72.9%. It performed better than Claude 3.7 Sonnet (64.9%), OpenAI’s o1 (61.7%), and the o3-mini high at 60.4%.
‘Google Delivered a Real Winner Here’
“Gemini 2.5 Pro is now easily the best model for code,” Mckay Wrigley, a developer, said. He also highlighted how the model doesn’t just agree with the user all the time, and demonstrated “flashes of genuine brilliance”.
“Google delivered a real winner here,” Wrigley added.
Even in various real-world scenarios, the experiences of many developers aligned with the benchmark scores, particularly when compared to Anthropic’s Claude 3.7 Sonnet.
A user on Reddit shared their experience of spending approximately three to four hours building an app with Claude 3.7 Sonnet, resulting in non-functional code with poor security practices, including hardcoded APIs.
After they switched to Gemini 2.5 and provided the entire faulty codebase as input, it identified and explained the flaws, while also rewriting the entire application effectively.
In another instance, Gemini 2.5 outperformed Claude 3.7 Sonnet in accurately reproducing a user interface. A user on X tested both models’ abilities in recreating ChatGPT’s user interface. Gemini 2.5 provided a more accurate representation.
— Ani Baddepudi (@AniBaddepudi) March 27, 2025All things considered, Gemini 2.5 is also a huge leap for Google over the preceding models. Alex Mizrahi, a developer, shared how he used the model to recall about 80-90% of Rell syntax purely from memory—a significant improvement over earlier Gemini versions, which previously struggled even when provided with examples.
Moreover, users expressed a preference for Gemini 2.5 over other models in the realm of vibe coding. Developer Matthew Berman said on X, “It (Gemini 2.5 Pro) asks me clarifying questions along the way, which no other model has done.” This indicates that it is “much more” collaborative.
Gemini 2.5 also has an advantage over other coding models due to its long 1 million input context window. OpenAI models, the o1 and the o3-mini, support only 250k tokens, while Anthropic is reportedly planning to extend to 500k tokens.
While it is an improvement over other models, it is still imperfect. It continues to pose all the classical concerns that are associated with AI models in coding.
Kaden Bilyeu, a developer, said on X that Gemini 2.5 was trying to create a client-side API for generating a chat response, indicating that the AI model was going to leak the API key.
Besides, there are also mixed reviews of the model handling large codebases. Louie Bacaj, a developer, revealed that Gemini 2.5 struggled significantly when working with a codebase of 3,500 lines of code.
He noted that despite claims of enhanced context handling, the model had trouble performing requested tasks even when API calls succeeded.
So there’s still a huge necessity for human judgment and intervention for using any AI model for coding. Besides, Google’s Gemini 2.5’s first model is the 2.5 Pro Experimental, which means it is still in the experimental phase. Hence, it is very likely to expect further refinements and improvements.
However, one area where Google needs to improve its game is packing its AI models better. This is precisely the reason why OpenAI’s GPT-4o gained more traction for image generation, even when Google released the same feature with Gemini 2.0 Flash model a few days ago.
Google Needs to Go Big on Consumer Experience
“I feel a little bit for the Google DeepMind team,” said Nikunj Kothari, an angel investor. “You build a world-changing model and everyone is posting Ghibli-fied pictures instead.”
He also said that this has been the core problem with Google, where they can build the best AI models in the world, but fail to focus on consumer experience. “I beg of them to take 20% of their best talented folks and give them free rein on building world-class consumer experiences,” Kothari added.
Besides, he added that the model’s personality is quite basic compared to the others. Notably, several other users also resonate with this.
When native image generation in Gemini 2.0 Flash was released, it earned praise for its capabilities. However, it wasn’t easy for many users to find and use the feature in the first place. The user interface was quite unintuitive, with options needlessly buried under menus.
Why did Google bury its incredible image generation feature in a UX like this?
To use it, I have to:
1. Somehow know to use AI Studio instead of regular Gemini
2. Somehow know to pick Gemini 2.0 Flash (image gen) model on the right
3. Somehow know that I can edit images… pic.twitter.com/fawUcx3Vzx
But circling back to the entire Ghibli mania, it might not be that Google failed in marketing its product effectively, but rather that OpenAI excelled at tapping into user psychology.
“You post two pictures and everyone gets it,” said a user on X, on OpenAI showcasing the image generation capabilities in GPT-4o.
“You ask the same people to read a report generated by 2.0 and compare [it] to 2.5, and that requires more time than scrolling and liking,” he added.
Scenarios like these highlight that regardless of how powerful your AI models are or how groundbreaking the underlying research might be, the average user tends to gravitate toward results that are enjoyable, relatable, and emotionally engaging.
📣 Want to advertise in AIM? Book here
Supreeth Koundinya
Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.
Related Posts
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Happy Llama 2025
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Cypher India 2025
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
MLDS 2026
India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru
Rising 2026
India's Biggest Summit on Women in Tech & AI 📍 Bengaluru