OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Authors

  • Dianbing Xi State Key Laboratory of CAD&CG, Zhejiang University Institute of Artificial Intelligence, China Telecom
  • Jiepeng Wang Institute of Artificial Intelligence, China Telecom
  • Yuanzhi Liang Institute of Artificial Intelligence, China Telecom
  • Xi Qiu Institute of Artificial Intelligence, China Telecom
  • Yuchi Huo State Key Laboratory of CAD&CG, Zhejiang University
  • Rui Wang State Key Laboratory of CAD&CG, Zhejiang University
  • Chi Zhang Institute of Artificial Intelligence, China Telecom
  • Xuelong Li Institute of Artificial Intelligence, China Telecom

DOI:

https://doi.org/10.1609/aaai.v40i13.38068

Abstract

In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff , aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. Our framework supports three key capabilities: (1) Text-conditioned video generation, where all modalities are jointly synthesized from a textual prompt; (2) Video understanding, where structural modalities are predicted from rgb inputs in a coherent manner; and (3) X-conditioned video generation, where video synthesis is guided by finegrained inputs such as depth, canny and segmentation. Extensive experiments demonstrate that OmniVDiff achieves state-of-the-art performance in video generation tasks and competitive results in video understanding. Its flexibility and scalability make it well-suited for downstream applications such as video-to-video translation, modality adaptation for visual tasks, and scene reconstruction.

Downloads

Published

2026-03-14

How to Cite

Xi, D., Wang, J., Liang, Y., Qiu, X., Huo, Y., Wang, R., Zhang, C., & Li, X. (2026). OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(13), 10915-10923. https://doi.org/10.1609/aaai.v40i13.38068

Issue

Section

AAAI Technical Track on Computer Vision X