PanoDiT: Panoramic Videos Generation with Diffusion Transformer

Authors

  • Muyang Zhang School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China
  • Yuzhi Chen School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China
  • Rongtao Xu MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Changwei Wang Qilu University of Technology (Shandong Academy of Sciences), Shandong, China
  • Jinming Yang MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Weiliang Meng MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Jianwei Guo School of Artificial Intelligence, Beijing Normal University, Beijing, China MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China
  • Huihuang Zhao College of Computer Science and Technology, Hengyang Normal University, Hunan, China
  • Xiaopeng Zhang MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i10.33089

Abstract

As immersive experiences become increasingly popular, panoramic video has garnered significant attention in both research and applications. The high cost associated with capturing panoramic video underscores the need for efficient prompt-based generation methods. Although recent text-to-video (T2V) diffusion techniques have shown potential in standard video generation, they face challenges when applied to panoramic videos due to substantial differences in content and motion patterns. In this paper, we propose PanoDiT, a framework that utilizes the Diffusion Transformer (DiT) architecture to generate panoramic videos from text descriptions. Unlike traditional methods that rely on UNet-based denoising, our method leverages a transformer architecture for denoising, incorporating both temporal and global attention mechanisms. This ensures coherent frame generation and smooth motion transitions, offering distinct advantages in long-horizon generation tasks. To further enhance motion and consistency in the generated videos, we introduce DTM-LoRA and two panoramic-specific losses. Compared to previous methods, our PanoDiT achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material.

Published

2025-04-11

How to Cite

Zhang, M., Chen, Y., Xu, R., Wang, C., Yang, J., Meng, W., … Zhang, X. (2025). PanoDiT: Panoramic Videos Generation with Diffusion Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 10040–10048. https://doi.org/10.1609/aaai.v39i10.33089

Issue

Section

AAAI Technical Track on Computer Vision IX