I am a PhD student jointly supervised by the Institute of Automation, Chinese Academy of Sciences (CASIA), and Beijing Academy of Artificial Intelligence (BAAI), under the supervision of Prof Zhaoxiang Zhang and Dr. Xinlong Wang. I obtained my MSc degree at
BUPT in China, supervised by Prof. Yonggang Qi. I also received my Bachelor's degree in Electronics Information Science and Technology from BUPT in 2022.
My research interests include generative models , with a particular focus on multimodal generation .
Emu3.5 is a natively multimodal world model that unifies vision and language through end-to-end next-token prediction on interleaved video-derived data, enhanced by reinforcement learning and DiDA-based parallel decoding for efficient, spatiotemporally consistent generation.
URSA is a simple yet powerful discrete framework that formulates video generation as an iterative process of global refinement over spatiotemporal tokens, enabling efficient scaling to long-duration videos.
NOVA is a non-quantized autoregressive model that enables efficient video generation
by reformulating the video creation as frame-by-frame and set-by-set predictions.
See3D is a scalable visual-conditional MVD model for open-world 3D creation, which can be trained on
web-scale video collections without camera pose annotations.
GeoDream is a 3D generation method that integrates explicit generalized 3D priors with 2D
diffusion priors to enhance the capability of obtaining unambiguous 3D consistent geometric
structures without sacrificing diversity or fidelity.
SketchKnitter is a method that achieves vectorized sketch generation by reversing the stroke
deformation process using a diffusion model learned from real sketches, enabling the creation of
higher quality, visually appealing sketches with fewer sampling steps.