NOVA is a non-quantized autoregressive model for efficient and flexible visual generation.
This paper presents a novel approach that enables autoregressive video generation
with high efficiency. We propose to reformulate the video generation problem as a
non-quantized autoregressive modeling of temporal frame-by-frame prediction and
spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive
models or joint distribution modeling of fixed-length tokens in diffusion models, our
approach maintains the causal property of GPT-style models for flexible in-context
capabilities, while leveraging bidirectional modeling within individual frames for
efficiency. With the proposed approach, we train a novel video autoregressive
model without vector quantization, termed NOVA. Our results demonstrate that
NOVA surpasses prior autoregressive video models in data efficiency, inference
speed, visual fidelity, and video fluency, even with a much smaller model capacity,
i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion
models in text-to-image generation tasks, with a significantly lower training cost.
Additionally, NOVA generalizes well across extended video durations and enables
diverse zero-shot applications in one unified model.
NOVA framework and the inference process. With text inputs, NOVA performs autoregressive
generation via temporal
frame-by-frame prediction and spatial set-by-set prediction. Finally, we
implement diffusion denoising in a continuous-values space.
Overview of our block-wise temporal and spatial generalized autoregressive attention.
Different from per-token generation, NOVA
regressively predicts each frame in a casual order across
the temporal scale, and predicts each token set in a random order across the spatial scale.
NOVA outperforms existing text-to-image models with superior performance and efficiency.
NOVA rivals diffusion text-to-video models and significantly suppresses the AR counterpart.
The following text prompts guided the creation of each
image, from left to right:
(1) A digital artwork of a cat styled in a whimsical
fashion...
(2) A solitary lighthouse standing tall against a backdrop
of
stormy seas and dark, rolling
clouds.
(3) A vibrant bouquet of wildflowers on a rustic wooden
table.
(4) A selfie of an old man with a white beard.
(5)A serene, expansive beach with no people.
(6)A blue apple and a green
cup.
(7)A chicken on the bottom of a
balloon.
@article{deng2024nova,
title={Autoregressive Video Generation without Vector Quantization},
author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
journal={arXiv preprint arXiv:2412.14169},
year={2024}
}