AI Model Creates Instant Tom and Jerry Episodes from Text

1 week ago 9

Published on April 9, 2025
In AI News

“Adding TTT layers into a pre-trained transformer enables it to generate one-minute videos from text storyboards.”

Researchers from NVIDIA, Stanford University, University of California San Diego, UC Berkeley and UT Austin have developed a new AI model that can whip up one-minute Tom and Jerry-style animation videos from just text storyboards. Picture this: dynamic, multi-scene adventures full of iconic chaos and mischief that fans love, all generated from simple written prompts.

Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.

We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.

Every video below is produced directly by… pic.twitter.com/Bh2impMBWA

— Karan Dalal (@karansdalal) April 7, 2025

The model, called TTT-MLP (Test-Time Training-Multilayer Perceptron), uses TTT layers. This enhances the capabilities of pre-trained transformers by allowing their hidden states to be neural networks. This approach enables more expressive and longer-term memory, crucial for generating coherent videos with complex narratives.

“Adding TTT layers into a pre-trained transformer enables it to generate one-minute videos from text storyboards,” researchers said.

Notably, the researchers created a dataset based on Tom and Jerry cartoons to test their model.

TTT-MLP outperforms all other baselines in temporal consistency, motion smoothness, and overall aesthetics, as measured by the human evaluation system Elo. In human evaluations, videos generated by TTT layers outperformed strong baselines like Mamba 2 and Gated DeltaNet by 34 Elo points.

One of the AI-made videos shows Tom walking into an office, taking the elevator, and sitting at his desk. However, things quickly turn wild when Jerry cuts a wire, starting their usual cat-and-mouse game—but this time, in a bustling office in New York City.

While the results are promising, researchers noted that the videos still contain artifacts, likely due to the limitations of the pre-trained model used. They also highlighted the potential for extending this approach to longer videos and more complex stories. According to researchers, achieving this would require significantly larger hidden states. Instead of a simple two-layer MLP, they said the hidden functions could be full-fledged neural networks, possibly even transformers.

Moreover, they added that several promising directions for future work exist, including faster implementation. The current TTT-MLP kernel runs into performance issues due to register spills and suboptimal ordering of asynchronous instructions. Researchers believe this could be improved by reducing register pressure and creating a more compiler-friendly implementation.

They also pointed out that using bidirectionality and learned gates is just one way to integrate TTT layers into a pre-trained model. Exploring better integration strategies could improve generation quality and speed up fine-tuning. They added that other types of video generation models, like autoregressive architectures, may need entirely different methods.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed