Continuous-space video generation has advanced rapidly,
while discrete approaches lag behind due to error accumulation and long-context inconsistency.
In this work, we revisit discrete generative modeling and present
Uniform discRete diffuSion with metric
pAth (URSA),
a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video
generation.
At its core, URSA formulates the video generation task as an iterative
global refinement of discrete spatiotemporal tokens. It integrates two key
designs:
a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism.
These designs enable URSA to scale efficiently to high-resolution image synthesis and
long-duration video generation, while requiring significantly fewer inference steps.
Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks
within a single model, including interpolation and image-to-video generation.
Extensive experiments on challenging video and image generation benchmarks demonstrate that
URSA
consistently outperforms existing discrete methods and achieves performance comparable to
state-of-the-art continuous diffusion methods.
Illustration of different image/video generation paradigms. Discrete-space approaches such as AR and MDM adopt non-refinable local generation, where produced tokens are fixed once generated. In contrast, URSA introduces iterative global refinement, conceptually aligning discrete methods with continuous-space approaches, and substantially narrowing their performance gap.
Global refinement via token distance in embedding space. Starting from categorical noise x0 (left), our framework refines data based on token distance to get target data x1 (right), enabling hierarchical structure generation from global semantics to fine details.
URSA rivals Sora-like text-to-video generation models despite using a discrete video tokenizer.
URSA emerges frame-conditioned video generation by accurately modeling the future motion.
URSA performs on par with the state-of-the-art models in generating high-resolution images.
@article{deng2025ursa,
title={Uniform Discrete Diffusion with Metric Path for Video Generation},
author={Deng, Haoge and Pan, Ting and Zhang, Fan and Liu, Yang and Luo, Zhuoyan and Cui, Yufeng and Shen, Chunhua and Shan, Shiguang and Zhang, Zhaoxiang and Wang, Xinlong},
journal={arXiv preprint arXiv:2510.24717},
year={2025}
}