Alibaba Releases Open-Source Video Generation Model Wan 2.1, Outperforms OpenAI’s Sora

1 month ago 26
  • Published on February 25, 2025
  • In AI News

The company has launched multiple models optimised for video generation, offering capabilities in text-to-video, image-to-video, video editing, text-to-image, and video-to-audio.

Chinese tech giant Alibaba has released Wan 2.1, its open-source video foundation model, along with the code and weights. The model can generate videos with complex motions that accurately simulate real-world physics.

“Wan2.1 consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks,” the company said in a blog post. 

The company has launched multiple models optimised for video generation, offering capabilities in text-to-video, image-to-video, video editing, text-to-image, and video-to-audio. The suite includes three main models: Wan2.1-I2V-14B, Wan2.1-T2V-14B, and Wan2.1-T2V-1.3B. 

The I2V-14B model generates videos at 480P and 720P resolutions, producing complex visual scenes and motion patterns. The T2V-14B model supports similar resolutions and is “the only video model capable of producing both Chinese and English text.”

The T2V-1.3B model is designed for consumer-grade GPUs, requiring 8.19 GB VRAM to generate a five-second 480P video in four minutes on an RTX 4090 GPU.

The model outperforms OpenAI’s Sora on the VBench Leaderboard, which evaluates video generation quality across 16 dimensions, including subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationships.

According to the company, the technical advancements in Wan2.1 are based on a new spatio-temporal variational autoencoder (VAE), scalable pre-training strategies, large-scale data construction, and automated evaluation metrics. 

“We propose a novel 3D causal VAE architecture specifically designed for video generation,” the company said. The model implements a feature cache mechanism, reducing memory usage and preserving temporal causality.

Performance tests indicate that Wan2.1’s VAE reconstructs video at 2.5 times the speed of HunYuanVideo on an A800 GPU. “This speed advantage will be further demonstrated at higher resolutions due to the small size design of our VAE model and the feature cache mechanism,” the company explained.

Wan2.1 employs the Flow Matching framework within the Diffusion Transformer (DiT) paradigm. It integrates the T5 encoder to process multi-language text inputs with cross-attention mechanisms. “Our experimental findings reveal a significant performance improvement with this approach at the same parameter scale,” the company said.

Wan2.1’s data pipeline involved curating and deduplicating 1.5 billion videos and 10 billion images. 

Alibaba recently released QwQ-Max-Preview, a new reasoning model in its Qwen AI family. The company plans to invest over $52 billion in cloud computing and artificial intelligence over the next three years.

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Association of Data Scientists

GenAI Corporate Training Programs

India's Biggest Women in Tech Summit

March 20 and 21, 2025 | 📍 NIMHANS Convention Center, Bengaluru

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Rising 2025 Women in Tech & AI

March 20 and 21, 2025 | 📍 NIMHANS Convention Center, Bengaluru

AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blue, Bangalore, India

Data Engineering Summit 2025

May 15-16, 2025 | 📍 Hotel Radisson Blu, Bengaluru

MachineCon GCC Summit 2025

June 20-22, 2025 | 📍 ITC Grand, Goa

Sep 17-19, 2025 | 📍KTPO, Whitefield, Bangalore, India

India's Biggest Developers Summit Feb, 2025 | 📍Nimhans Convention Center, Bangalore

discord icon

Our Discord Community for AI Ecosystem.

Read Entire Article