- Published on April 17, 2025
- In AI News
Microsoft has released the model weights on Hugging Face, along with open-source code for running it.
Microsoft Research has introduced BitNet b1.58 2B4T, a new 2-billion parameter language model that uses only 1.58 bits per weight instead of the usual 16 or 32. Despite its compact size, it matches the performance of full-precision models and runs efficiently on both GPUs and CPUs.
The model was trained on a large dataset containing 4 trillion tokens and performs well across a wide range of tasks, including language understanding, math, coding, and conversation. Microsoft has released the model weights on Hugging Face, along with open-source code for running it.
In the technical report, Microsoft said that “BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency.”
The model’s architecture is “derived from the standard Transformer model… incorporating significant modifications based on the BitNet framework”. The central innovation is “replacing the standard full-precision linear layers with custom BitLinear layers”, where “model weights are quantised to 1.58 bits during the forward pass”. This quantisation uses an “absolute mean (absmean) quantisation scheme, which maps weights to ternary values {-1, 0, +1}.”
Activations are quantised to 8-bit integers with an “absolute maximum (absmax) quantisation strategy, applied per token”. Subln normalisation is incorporated to further enhance training stability. The feed-forward network (FFN) sub-layers employ squared ReLU (ReLU²) activation.
Rotary Position Embeddings (RoPE) inject positional information. Consistent with architectures like LLaMA, all bias terms are removed from the linear layers and normalisation layers. The tokeniser developed for LLaMA 3 implements a byte-level Byte-Pair Encoding (BPE) scheme with a vocabulary size of 128,256 tokens.
The training process for BitNet b1.58 2B4T consists of three phases, pre-training, supervised fine-tuning (SFT), and direct preference optimisation (DPO).
BitNet b1.58 2B4T demonstrates that it’s possible to dramatically reduce the computational requirements of large language models without giving up performance. With its compact architecture and competitive results, it represents a meaningful step forward in making AI models more efficient and accessible.
Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Happy Llama 2025
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Cypher India 2025
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
MLDS 2026
India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru
Rising 2026
India's Biggest Summit on Women in Tech & AI 📍 Bengaluru