Autoregressive Video Generation without Vector Quantization

Arxiv 2024
Haoge Deng 1,4*,   Ting Pan 2,4*,   Haiwen Diao 3,4*,   Zhengxiong Luo 4*,   Yufeng Cui 4,   Huchuan Lu 3,   Shiguang Shan 2,   Yonggang Qi 1†,   Xinlong Wang 4†‡

1 BUPT

,  

2 ICT-CAS

,  

3 DLUT

,  

4 BAAI

,
* Equal contribution, † Corresponding author, ‡ Project leader,

NOVA is a non-quantized autoregressive model for efficient and flexible visual generation.

Abstract

This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model.

Method

NOVA framework and the inference process. With text inputs, NOVA performs autoregressive generation via temporal
frame-by-frame prediction and spatial set-by-set prediction. Finally, we implement diffusion denoising in a continuous-values space.

Overview of our block-wise temporal and spatial generalized autoregressive attention. Different from per-token generation, NOVA
regressively predicts each frame in a casual order across the temporal scale, and predicts each token set in a random order across the spatial scale.

Quantitative Results

NOVA outperforms existing text-to-image models with superior performance and efficiency.

NOVA rivals diffusion text-to-video models and significantly suppresses the AR counterpart.

Visual Results

Text-to-image Visualization

The following text prompts guided the creation of each image, from left to right:
(1) A digital artwork of a cat styled in a whimsical fashion...
(2) A solitary lighthouse standing tall against a backdrop of stormy seas and dark, rolling clouds.
(3) A vibrant bouquet of wildflowers on a rustic wooden table.
(4) A selfie of an old man with a white beard.
(5)A serene, expansive beach with no people.
(6)A blue apple and a green cup.
(7)A chicken on the bottom of a balloon.

Text-to-video Visualization

Zero-shot video extrapolation (33 frames -> 65 frames)

Zero-shot generalization on multiple contexts

BibTeX


        @article{deng2024nova,
            title={Autoregressive Video Generation without Vector Quantization},
            author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
            journal={arXiv preprint arXiv:2412.14169},
            year={2024}
        }