Llama 4 Sparks ‘RAG Is Dead’ Debate, Yet Again

1 week ago 10

Spoiler alert: RAG is NOT dead.

When Meta announced the long-awaited next generation of its open-source model, Llama 4, debates emerged on social media about whether this marks the end of retrieval-augmented generation (RAG), due to the model’s 10-million context window. The massive context window allows the model to process significantly large amounts of information in a single query, raising several questions about the necessity of RAG.

— Peter Yang (@petergyang) April 5, 2025

Shorter-context models often rely on external retrieval to access data. However, Llama 4’s larger context enables it to manage more information internally, thereby decreasing the need for external sources when reasoning or processing static data. But is this sufficient to signify the end of RAG?

Leave RAG Alone, Please

Several developers and industry experts rallied to defend RAG, which has faced many challenges. Regarding costs, pushing 10 million tokens into a context window will not be cheap—it will exceed a dollar per query and take ‘tens of seconds’ to generate a response, as indicated by Marco D’Alia, a software architect on X.

People are saying the 10 million context size of @meta Llama 4 means RAG is dead.

I have two questions for you:

1) Do you want to spend $1+ for each message?

2) Do you want to wait a VERY long time on every message to process all those tokens?

— Tristan Rhodes (@tristanbob) April 5, 2025

Additionally, many emphasised that longer context windows were never meant to replace RAG, whose capabilities primarily focused on adding relevant chunks of information to the input.

“RAG isn’t about solving for a finite context window, it’s about filtering for signal from a noisy dataset. No matter how big and powerful your context window gets, removing junk data from the input will always improve performance,” said Jamie Voynow, a machine learning engineer on X.

Gokul JS, a founding engineer of Aerotime, summarised the entire debate with a simple analogy: “Imagine handing someone a dense page of text, taking it away, then asking questions. They’ll remember bits, not everything,” he said in a post on X. He added that LLMs are no different in such situations and that just because they handle more context doesn’t always guarantee an accurate response.

Furthermore, a 10 million context window is huge, but it may not encompass every use case. Granted, RAG use cases have certainly reduced with time, given how most AI models retrieve information from a few PDFs with ease, but several practical use cases will need to go beyond that.

“Most enterprises have terabytes of documents. No context window can encompass a pharmaceutical company’s 50K+ research papers and decades of regulatory submissions,” said Skylar Payne, a former ML systems engineer at Google and LinkedIn.

It could make sense if we’re talking about how gpt-3.5 used to have 4k context and we needed RAG for an arxiv paper but we don’t have to now.

Back to the present: Even with 10M context, we’ll probably still RAG for arxiv papers from 2025 alone, and I’m not sure loading 10M worth…

— Eugene Yan (@eugeneyan) April 6, 2025

Additionally, AI models have knowledge cutoffs. This means they cannot answer queries dependent on the latest real-time information unless retrieved dynamically, which requires using RAG.

Moreover, if someone plans to run Llama 4 on inference providers like Groq or Together AI, these services offer a context limit significantly lower than 10 million. Groq provides approximately 130,000 tokens for both the Llama 4 Scout and Maverick. Together AI offers about 300,000 tokens for the Llama 4 Scout and approximately 520,000 tokens for the Llama 4 Maverick.

LLMs Perform Poorly Beyond 32,000 Tokens

Moreover, a study revealed that after 30,000 tokens in context, LLMs exhibited a decline in performance. Although it did not include the Llama 4 model, the study indicated that at 32k tokens, 10 out of 12 tested AI models performed below half their short-context baseline. Even OpenAI’s GPT-4o, one of the top performers, dropped from a baseline score of 99.3% to 69.7%.

“Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information,” read the study.

The study also noted that conflicting information within the context can confuse the AI model, making it necessary to apply a filtering step to remove irrelevant or misleading content. “That’s usually not a problem with RAG, but if we indiscriminately put everything in the context, we’ll also need a filtering step,” said D’Alia, who cited the above study to back his arguments.

All things considered, Meta’s Llama 4 is indeed a huge step forward in open source AI.

Artificial Analysis, a platform that evaluates AI models, said that the Llama 4 Maverick beats the Claude 3.7 Sonnet but trails the DeepSeek-V3 while being more efficient. On the other hand, the Llama 4 Scout offers performance parity with the GPT-4o mini.

On the MMLU-Pro benchmark, which evaluates LLMs on reasoning-focused questions, the Llama 4 Maverick scored 80%, matching the Claude 3.7 Sonnet (80%) and OpenAI’s o3-mini (79%).

On the GPQA Diamond benchmark, which tests AI models on graduate-level science questions, the Llama 4 Maverick scored 60%, lower than Gemini 2.0 Flash (60%) and DeepSeek V3 (66%).

📣 Want to advertise in AIM? Book here

Supreeth Koundinya

Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed