OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
DOI:
https://doi.org/10.1609/aaai.v40i13.38068Abstract
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff , aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. Our framework supports three key capabilities: (1) Text-conditioned video generation, where all modalities are jointly synthesized from a textual prompt; (2) Video understanding, where structural modalities are predicted from rgb inputs in a coherent manner; and (3) X-conditioned video generation, where video synthesis is guided by finegrained inputs such as depth, canny and segmentation. Extensive experiments demonstrate that OmniVDiff achieves state-of-the-art performance in video generation tasks and competitive results in video understanding. Its flexibility and scalability make it well-suited for downstream applications such as video-to-video translation, modality adaptation for visual tasks, and scene reconstruction.Downloads
Published
2026-03-14
How to Cite
Xi, D., Wang, J., Liang, Y., Qiu, X., Huo, Y., Wang, R., Zhang, C., & Li, X. (2026). OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(13), 10915-10923. https://doi.org/10.1609/aaai.v40i13.38068
Issue
Section
AAAI Technical Track on Computer Vision X