MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

Authors

  • Penghui Liu Beijing University of Technology
  • Jiangshan Wang Tsinghua University, Tsinghua University
  • Yutong Shen Beijing University of Technology
  • Shanhui Mo Independent Researcher
  • Chenyang Qi Hong Kong University of Science and Technology
  • Jack Ma Tsinghua University, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i9.37660

Abstract

Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Mask-aware Attention Motion Flow (AMF), which utilizes SAM 2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability.The code is in the supp.

Downloads

Published

2026-03-14

How to Cite

Liu, P., Wang, J., Shen, Y., Mo, S., Qi, C., & Ma, J. (2026). MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7233–7241. https://doi.org/10.1609/aaai.v40i9.37660

Issue

Section

AAAI Technical Track on Computer Vision VI