Alibaba Releases Open-Source Video Generation Model Wan 2.1, Outperforms OpenAI’s Sora

1 month ago 26

Published on February 25, 2025
In AI News

The company has launched multiple models optimised for video generation, offering capabilities in text-to-video, image-to-video, video editing, text-to-image, and video-to-audio.

Chinese tech giant Alibaba has released Wan 2.1, its open-source video foundation model, along with the code and weights. The model can generate videos with complex motions that accurately simulate real-world physics.

“Wan2.1 consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks,” the company said in a blog post.

The company has launched multiple models optimised for video generation, offering capabilities in text-to-video, image-to-video, video editing, text-to-image, and video-to-audio. The suite includes three main models: Wan2.1-I2V-14B, Wan2.1-T2V-14B, and Wan2.1-T2V-1.3B.

The I2V-14B model generates videos at 480P and 720P resolutions, producing complex visual scenes and motion patterns. The T2V-14B model supports similar resolutions and is “the only video model capable of producing both Chinese and English text.”

The T2V-1.3B model is designed for consumer-grade GPUs, requiring 8.19 GB VRAM to generate a five-second 480P video in four minutes on an RTX 4090 GPU.

The model outperforms OpenAI’s Sora on the VBench Leaderboard, which evaluates video generation quality across 16 dimensions, including subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationships.

According to the company, the technical advancements in Wan2.1 are based on a new spatio-temporal variational autoencoder (VAE), scalable pre-training strategies, large-scale data construction, and automated evaluation metrics.

“We propose a novel 3D causal VAE architecture specifically designed for video generation,” the company said. The model implements a feature cache mechanism, reducing memory usage and preserving temporal causality.

Performance tests indicate that Wan2.1’s VAE reconstructs video at 2.5 times the speed of HunYuanVideo on an A800 GPU. “This speed advantage will be further demonstrated at higher resolutions due to the small size design of our VAE model and the feature cache mechanism,” the company explained.

Wan2.1 employs the Flow Matching framework within the Diffusion Transformer (DiT) paradigm. It integrates the T5 encoder to process multi-language text inputs with cross-attention mechanisms. “Our experimental findings reveal a significant performance improvement with this approach at the same parameter scale,” the company said.

Wan2.1’s data pipeline involved curating and deduplicating 1.5 billion videos and 10 billion images.

Alibaba recently released QwQ-Max-Preview, a new reasoning model in its Qwen AI family. The company plans to invest over $52 billion in cloud computing and artificial intelligence over the next three years.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.