Snowflake AI’s SwiftKV Cuts Meta Llama Inference Costs by Up to 75% 

3 months ago 32
  • Published on January 17, 2025
  • In AI News

It reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots.

Snowflake Open Sources Arctic, Family of Embedding Models for RAG

Snowflake AI Research has introduced SwiftKV, an optimisation framework integrated into vLLM that significantly reduces inference costs for Meta Llama large language models (LLMs). 

The SwiftKV-optimised models, Snowflake-Llama-3.3-70B and Snowflake-Llama-3.1-405B, are available for serverless inference on Cortex AI. They offer cost reductions of up to 75% compared to the baseline Meta Llama models without SwiftKV.

“SwiftKV’s introduction comes at a critical moment for enterprises embracing LLM technologies. With the growth of use cases, organisations need solutions that deliver both immediate performance gains and long-term scalability,” the company said. 

The framework reduces computational overhead during the key-value (KV) cache generation stage by reusing hidden states from earlier transformer layers. According to Snowflake AI Research, this optimisation cuts prefill compute by up to 50% while maintaining enterprise-grade accuracy.

“Our approach combines model rewiring with lightweight fine-tuning and self-distillation to preserve performance,” the team explained. Accuracy loss is limited to about one point across benchmarks.

SwiftKV delivers performance improvements, including up to twice the throughput for models like Llama-3.3-70B in GPU environments such as NVIDIA H100s. It also reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots.

“It is designed to integrate seamlessly with vLLM, enabling additional optimisation techniques such as attention optimisation and speculative decoding,” the Snowflake team said.

Beyond its integration with Cortex AI, SwiftKV is open-source, with model checkpoints available on Hugging Face and optimised inference on vLLM. The team has also released the ArcticTraining Framework, a post-training library for building SwiftKV models, enabling enterprises and researchers to deploy custom solutions.

“By tackling computational bottlenecks, SwiftKV allows enterprises to maximise the potential of their LLM deployments,” Snowflake AI Research said.

Snowflake recently entered a multi-year deal with AI safety and research company Anthropic to use its Claude models. This partnership will make Anthropic’s Claude models available to customers through Snowflake Cortex AI and help businesses worldwide get more value from their data.

More businesses are turning to Snowflake’s cloud data to organise their data using AI. Like Salesforce and Microsoft, Snowflake is developing AI agents with its Snowflake Intelligence platform.

Snowflake chief Sridhar Ramaswamy believes it will simplify how enterprises derive value from data. “Imagine asking a data agent, ‘Give me a summary of this Google Doc’ or ‘Tell me how many deals we had in North America last quarter’, and instantly following up with the next steps using that same agent. That’s exactly what Snowflake Intelligence will enable – a seamless way to access and act on your data in one place,” he added.

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Association of Data Scientists

GenAI Corporate Training Programs

India's Biggest Developers Summit

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Rising 2025 | DE&I in Tech & AI

Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru

Data Engineering Summit 2025

15-16 May, 2025 | 📍 Taj Yeshwantpur, Bengaluru, India

AI Startups Conference.
April 25 / Hotel Radisson Blu / Bangalore, India

17-19 September, 2025 | 📍KTPO, Whitefield, Bangalore, India

MachineCon GCC Summit 2025

19-20th June 2025 | Bangalore

discord icon

Our Discord Community for AI Ecosystem.

Read Entire Article