Emu3.5 is a natively multimodal world model that unifies vision and language through end-to-end next-token prediction on interleaved video-derived data, enhanced by reinforcement learning and DiDA-based parallel decoding for efficient, spatiotemporally consistent generation.
URSA is a simple yet powerful discrete framework that formulates video generation as an iterative process of global refinement over spatiotemporal tokens, enabling efficient scaling to long-duration videos.
NOVA is a non-quantized autoregressive model that enables efficient video generation
by reformulating the video creation as frame-by-frame and set-by-set predictions.
See3D is a scalable visual-conditional MVD model for open-world 3D creation, which can be trained on
web-scale video collections without camera pose annotations.
GeoDream is a 3D generation method that integrates explicit generalized 3D priors with 2D
diffusion priors to enhance the capability of obtaining unambiguous 3D consistent geometric
structures without sacrificing diversity or fidelity.
SketchKnitter is a method that achieves vectorized sketch generation by reversing the stroke
deformation process using a diffusion model learned from real sketches, enabling the creation of
higher quality, visually appealing sketches with fewer sampling steps.