UniScene-MoTion: Unified Scene & Motion-aware Diffusion Transition Framework

Authors

  • Rui Jiang Zhejiang University
  • Chongmian Wang Zhejiang University
  • Xinghe Fu Zhejiang University
  • Yehao Lu Zhejiang University
  • Teng Li Zhejiang University
  • Xi Li Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i7.37458

Abstract

Video transitions are critical for ensuring temporal coherence in edited media, yet existing methods often rely on handcrafted effects or relative-scale trajectories that fail to capture the physical structure of real-world scenes. In this work, we introduce a scale-aware video transition framework that explicitly incorporates depth-aware 3D reasoning into a diffusion-based generation pipeline. Built upon a powerful I2V foundation, our method leverages single-image depth prediction to align camera motion with metric-scale geometry, enabling physically consistent transitions. To reduce reliance on precise camera inputs, we propose a bidirectional conditional control module and a progressive training strategy with conditional dropout, enhancing generalization to loosely specified or missing camera trajectories. Extensive experiments demonstrate that our approach achieves state-of-the-art performance, delivering realistic, geometrically coherent transitions across diverse scenes and applications with minimal input guidance.

Downloads

Published

2026-03-14

How to Cite

Jiang, R., Wang, C., Fu, X., Lu, Y., Li, T., & Li, X. (2026). UniScene-MoTion: Unified Scene & Motion-aware Diffusion Transition Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5415–5423. https://doi.org/10.1609/aaai.v40i7.37458

Issue

Section

AAAI Technical Track on Computer Vision IV