- Published on January 17, 2025
- In AI News
It reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots.
Snowflake AI Research has introduced SwiftKV, an optimisation framework integrated into vLLM that significantly reduces inference costs for Meta Llama large language models (LLMs).
The SwiftKV-optimised models, Snowflake-Llama-3.3-70B and Snowflake-Llama-3.1-405B, are available for serverless inference on Cortex AI. They offer cost reductions of up to 75% compared to the baseline Meta Llama models without SwiftKV.
“SwiftKV’s introduction comes at a critical moment for enterprises embracing LLM technologies. With the growth of use cases, organisations need solutions that deliver both immediate performance gains and long-term scalability,” the company said.
The framework reduces computational overhead during the key-value (KV) cache generation stage by reusing hidden states from earlier transformer layers. According to Snowflake AI Research, this optimisation cuts prefill compute by up to 50% while maintaining enterprise-grade accuracy.
“Our approach combines model rewiring with lightweight fine-tuning and self-distillation to preserve performance,” the team explained. Accuracy loss is limited to about one point across benchmarks.
SwiftKV delivers performance improvements, including up to twice the throughput for models like Llama-3.3-70B in GPU environments such as NVIDIA H100s. It also reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots.
“It is designed to integrate seamlessly with vLLM, enabling additional optimisation techniques such as attention optimisation and speculative decoding,” the Snowflake team said.
Beyond its integration with Cortex AI, SwiftKV is open-source, with model checkpoints available on Hugging Face and optimised inference on vLLM. The team has also released the ArcticTraining Framework, a post-training library for building SwiftKV models, enabling enterprises and researchers to deploy custom solutions.
“By tackling computational bottlenecks, SwiftKV allows enterprises to maximise the potential of their LLM deployments,” Snowflake AI Research said.
Snowflake recently entered a multi-year deal with AI safety and research company Anthropic to use its Claude models. This partnership will make Anthropic’s Claude models available to customers through Snowflake Cortex AI and help businesses worldwide get more value from their data.
More businesses are turning to Snowflake’s cloud data to organise their data using AI. Like Salesforce and Microsoft, Snowflake is developing AI agents with its Snowflake Intelligence platform.
Snowflake chief Sridhar Ramaswamy believes it will simplify how enterprises derive value from data. “Imagine asking a data agent, ‘Give me a summary of this Google Doc’ or ‘Tell me how many deals we had in North America last quarter’, and instantly following up with the next steps using that same agent. That’s exactly what Snowflake Intelligence will enable – a seamless way to access and act on your data in one place,” he added.
Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
February 5 – 7, 2025 | Nimhans Convention Center, Bangalore
Rising 2025 | DE&I in Tech & AI
Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru
Data Engineering Summit 2025
15-16 May, 2025 | 📍 Taj Yeshwantpur, Bengaluru, India
AI Startups Conference.
April 25 /
Hotel Radisson Blu /
Bangalore, India
17-19 September, 2025 | 📍KTPO, Whitefield, Bangalore, India
MachineCon GCC Summit 2025
19-20th June 2025 | Bangalore
Our Discord Community for AI Ecosystem.