- Published on April 9, 2025
- In AI News
“Adding TTT layers into a pre-trained transformer enables it to generate one-minute videos from text storyboards.”

Researchers from NVIDIA, Stanford University, University of California San Diego, UC Berkeley and UT Austin have developed a new AI model that can whip up one-minute Tom and Jerry-style animation videos from just text storyboards. Picture this: dynamic, multi-scene adventures full of iconic chaos and mischief that fans love, all generated from simple written prompts.
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.
We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.
Every video below is produced directly by… pic.twitter.com/Bh2impMBWA
The model, called TTT-MLP (Test-Time Training-Multilayer Perceptron), uses TTT layers. This enhances the capabilities of pre-trained transformers by allowing their hidden states to be neural networks. This approach enables more expressive and longer-term memory, crucial for generating coherent videos with complex narratives.
“Adding TTT layers into a pre-trained transformer enables it to generate one-minute videos from text storyboards,” researchers said.
Notably, the researchers created a dataset based on Tom and Jerry cartoons to test their model.
TTT-MLP outperforms all other baselines in temporal consistency, motion smoothness, and overall aesthetics, as measured by the human evaluation system Elo. In human evaluations, videos generated by TTT layers outperformed strong baselines like Mamba 2 and Gated DeltaNet by 34 Elo points.
One of the AI-made videos shows Tom walking into an office, taking the elevator, and sitting at his desk. However, things quickly turn wild when Jerry cuts a wire, starting their usual cat-and-mouse game—but this time, in a bustling office in New York City.
While the results are promising, researchers noted that the videos still contain artifacts, likely due to the limitations of the pre-trained model used. They also highlighted the potential for extending this approach to longer videos and more complex stories. According to researchers, achieving this would require significantly larger hidden states. Instead of a simple two-layer MLP, they said the hidden functions could be full-fledged neural networks, possibly even transformers.
Moreover, they added that several promising directions for future work exist, including faster implementation. The current TTT-MLP kernel runs into performance issues due to register spills and suboptimal ordering of asynchronous instructions. Researchers believe this could be improved by reducing register pressure and creating a more compiler-friendly implementation.
They also pointed out that using bidirectionality and learned gates is just one way to integrate TTT layers into a pre-trained model. Exploring better integration strategies could improve generation quality and speed up fine-tuning. They added that other types of video generation models, like autoregressive architectures, may need entirely different methods.
Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Happy Llama 2025
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Cypher India 2025
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
MLDS 2026
India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru
Rising 2026
India's Biggest Summit on Women in Tech & AI 📍 Bengaluru