Posted on X by Karan Dalal Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.

We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.

Every video below is produced directly by

Research Notes: One-Minute Video Generation with Test-Time Training

Overview

The paper introduces a novel method for generating one-minute-long animated videos using Test-Time Training (TTT). By incorporating TTT layers into a pre-trained Transformer model, the researchers achieved high temporal consistency in video generation. The approach focuses on producing short-form animations, such as Tom and Jerry cartoons, directly from the model without extensive fine-tuning. This advancement addresses the challenge of maintaining coherence over longer video sequences [Result 1].

Technical Analysis

The core innovation lies in integrating TTT layers into a pre-trained Transformer architecture. These layers enable the model to adapt to specific tasks during inference, allowing for real-time adjustments and improved temporal consistency. The researchers leveraged a dataset of short video clips, likely sourced from platforms like YouTube, to fine-tune the model [Result 3].

The method builds upon recent advancements in language models, such as LLaMA, by extending their capabilities to handle sequential data. Unlike traditional fine-tuning approaches that require retraining on large datasets, TTT allows for incremental updates during inference, making the process more efficient and scalable [Result 2].

The implementation employs a two-stage process: pre-training on a general video dataset and then applying TTT during the generation phase to adapt to specific content (e.g., Tom and Jerry). This approach ensures that the model maintains high-quality output while reducing computational overhead [Result 4].

Implementation Details

Model Architecture: The researchers used a Transformer-based architecture pre-trained on large-scale video data. TTT layers were added to enable dynamic adjustments during inference.
Training Data: A combination of general video datasets and domain-specific content (e.g., cartoons) was used for fine-tuning.
Evaluation Metrics: The model's performance was evaluated using metrics such as temporal consistency, visual quality, and user preference studies on a test set [Result 5].

The code for this project is available on GitHub, with the implementation details and examples provided in the repository [Result 4].

Language Models: The TTT approach draws inspiration from recent work on language models like LLaMA, which use similar techniques for fine-tuning during inference [Result 2].
Diffusion Models: While not directly comparable, diffusion models have been used in video generation tasks, and the temporal consistency achieved here could be complementary to such approaches.
Video Generation: Previous work on video GANs (Generative Adversarial Networks) focused on shorter sequences, but this paper extends these methods to one-minute videos with improved stability [Result 1].

Key Takeaways

The introduction of TTT layers significantly improves temporal consistency in video generation tasks [Result 1].
The method demonstrates the potential of adapting language model techniques (e.g., LLaMA) for video generation [Result 2].
The use of fine-tuning during inference reduces computational overhead and enables real-time adjustments to generated content [Result 4].

Further Research

Here's a well-organized 'Further Reading' section based on the provided search results:

One-Minute Video Generation with Test-Time Training
- arXiv
- Additional resources include PDFs available on arXiv and CVF Open Access.
Project Website
- test-time-training.github.io/video-dit
Research Paper (PDF)
- Available via CVF Open Access
GitHub Repository
- test-time-training/ttt-video-dit

This section consolidates the unique sources while noting that some are duplicates or supplementary materials.

One-Minute Video Generation with Test-Time Training