Snowflake AI’s SwiftKV Cuts Meta Llama Inference Costs by Up to 75%

3 months ago 32

Published on January 17, 2025
In AI News

It reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots.

Snowflake Open Sources Arctic, Family of Embedding Models for RAG

Snowflake AI Research has introduced SwiftKV, an optimisation framework integrated into vLLM that significantly reduces inference costs for Meta Llama large language models (LLMs).

The SwiftKV-optimised models, Snowflake-Llama-3.3-70B and Snowflake-Llama-3.1-405B, are available for serverless inference on Cortex AI. They offer cost reductions of up to 75% compared to the baseline Meta Llama models without SwiftKV.

“SwiftKV’s introduction comes at a critical moment for enterprises embracing LLM technologies. With the growth of use cases, organisations need solutions that deliver both immediate performance gains and long-term scalability,” the company said.

The framework reduces computational overhead during the key-value (KV) cache generation stage by reusing hidden states from earlier transformer layers. According to Snowflake AI Research, this optimisation cuts prefill compute by up to 50% while maintaining enterprise-grade accuracy.

“Our approach combines model rewiring with lightweight fine-tuning and self-distillation to preserve performance,” the team explained. Accuracy loss is limited to about one point across benchmarks.

SwiftKV delivers performance improvements, including up to twice the throughput for models like Llama-3.3-70B in GPU environments such as NVIDIA H100s. It also reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots.

“It is designed to integrate seamlessly with vLLM, enabling additional optimisation techniques such as attention optimisation and speculative decoding,” the Snowflake team said.

Beyond its integration with Cortex AI, SwiftKV is open-source, with model checkpoints available on Hugging Face and optimised inference on vLLM. The team has also released the ArcticTraining Framework, a post-training library for building SwiftKV models, enabling enterprises and researchers to deploy custom solutions.

“By tackling computational bottlenecks, SwiftKV allows enterprises to maximise the potential of their LLM deployments,” Snowflake AI Research said.

Snowflake recently entered a multi-year deal with AI safety and research company Anthropic to use its Claude models. This partnership will make Anthropic’s Claude models available to customers through Snowflake Cortex AI and help businesses worldwide get more value from their data.

More businesses are turning to Snowflake’s cloud data to organise their data using AI. Like Salesforce and Microsoft, Snowflake is developing AI agents with its Snowflake Intelligence platform.

Snowflake chief Sridhar Ramaswamy believes it will simplify how enterprises derive value from data. “Imagine asking a data agent, ‘Give me a summary of this Google Doc’ or ‘Tell me how many deals we had in North America last quarter’, and instantly following up with the next steps using that same agent. That’s exactly what Snowflake Intelligence will enable – a seamless way to access and act on your data in one place,” he added.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.