DeepSeek Launches FlashMLA, an MLA Decoding Kernel for Hopper GPUs

1 month ago 18

Published on February 24, 2025
In AI News

The kernel supports BF16 and features a paged KV cache with a block size of 64.

DeepSeek, a Chinese artificial intelligence (AI) lab by High-Flyer startup, has kicked off its “Open Source Week” by releasing FlashMLA, a decoding kernel designed for Hopper GPUs. It is optimised for processing variable-length sequences and is now in production.

The kernel supports BF16 and features a paged KV cache with a block size of 64. On the H800 GPU, it achieves speeds of 3000 GB/s in memory-bound configurations and 580 TFLOPS in compute-bound configurations.

DeepSeek says FlashMLA is inspired by projects like FlashAttention 2&3 and Cutlass. The kernel is available on GitHub for exploration and use.

“Honored to share FlashMLA – our efficient MLA decoding kernel for Hopper GPUs, optimised for variable-length sequences and now in production,” the company said in a post on X.

The release of FlashMLA is expected to improve computational efficiency, particularly in applications involving AI and potentially impacting sectors like cryptocurrency trading algorithms. FlashMLA, available on GitHub, offers high performance with speeds of up to 3000 GB/s for memory tasks and 580 TFLOPS for computing.

DeepSeek recently announced it is launching five open-source repositories starting this week. “We’re a tiny team (at) DeepSeek exploring AGI (Artificial General Intelligence). Starting next week, we’ll be open-sourcing five repos, sharing our small but sincere progress with full transparency,” it said on X.

Currently, it has a collection of 14 open-source models and repositories on Hugging Face.

Recently, it released its DeepSeek-R1 and DeepSeek-V3 models. These AI models offer state-of-the-art performance while being trained and deployed at a fraction of the cost of their competitors.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.