Microsoft Unveils 1-Bit Compact LLM that Runs on CPUs

2 days ago 6

Published on April 17, 2025
In AI News

Microsoft has released the model weights on Hugging Face, along with open-source code for running it.

Microsoft Research has introduced BitNet b1.58 2B4T, a new 2-billion parameter language model that uses only 1.58 bits per weight instead of the usual 16 or 32. Despite its compact size, it matches the performance of full-precision models and runs efficiently on both GPUs and CPUs.

The model was trained on a large dataset containing 4 trillion tokens and performs well across a wide range of tasks, including language understanding, math, coding, and conversation. Microsoft has released the model weights on Hugging Face, along with open-source code for running it.

In the technical report, Microsoft said that “BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency.”

The model’s architecture is “derived from the standard Transformer model… incorporating significant modifications based on the BitNet framework”. The central innovation is “replacing the standard full-precision linear layers with custom BitLinear layers”, where “model weights are quantised to 1.58 bits during the forward pass”. This quantisation uses an “absolute mean (absmean) quantisation scheme, which maps weights to ternary values {-1, 0, +1}.”

Activations are quantised to 8-bit integers with an “absolute maximum (absmax) quantisation strategy, applied per token”. Subln normalisation is incorporated to further enhance training stability. The feed-forward network (FFN) sub-layers employ squared ReLU (ReLU²) activation.

Rotary Position Embeddings (RoPE) inject positional information. Consistent with architectures like LLaMA, all bias terms are removed from the linear layers and normalisation layers. The tokeniser developed for LLaMA 3 implements a byte-level Byte-Pair Encoding (BPE) scheme with a vocabulary size of 128,256 tokens.

The training process for BitNet b1.58 2B4T consists of three phases, pre-training, supervised fine-tuning (SFT), and direct preference optimisation (DPO).

BitNet b1.58 2B4T demonstrates that it’s possible to dramatically reduce the computational requirements of large language models without giving up performance. With its compact architecture and competitive results, it represents a meaningful step forward in making AI models more efficient and accessible.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed