MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Authors

  • Tuna Han Salih Meral Virginia Polytechnic Institute and State University
  • Hidir Yesiltepe Virginia Polytechnic Institute and State University
  • Connor Dunlop Virginia Polytechnic Institute and State University
  • Pinar Yanardag Virginia Polytechnic Institute and State University

DOI:

https://doi.org/10.1609/aaai.v40i10.37750

Abstract

Text-to-video models have demonstrated impressive capabilities in producing diverse video content, yet often lack fine-grained control over motion. We address the problem of motion transfer: given a source video and a target text prompt, generate a new video that preserves the source motion while matching the target semantics and allowing large changes in appearance and scene layout. We introduce MotionFlow, a training-free framework that performs test-time latent optimization guided by attention-derived motion cues. MotionFlow first extracts cross-attention maps from a pre-trained video diffusion model and converts them into spatio-temporal motion masks for the source subject. During generation, it optimizes the target latents so that their evolving attention patterns align with these masks, while the target text controls appearance. This avoids direct attention-map replacement and any model-specific fine-tuning, reducing artifacts and improving flexibility. Qualitative and quantitative experiments, including a user study, show that MotionFlow outperforms existing methods in motion fidelity, temporal consistency, and versatility, even under drastic scene changes.

Downloads

Published

2026-03-14

How to Cite

Meral, T. H. S., Yesiltepe, H., Dunlop, C., & Yanardag, P. (2026). MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8043-8051. https://doi.org/10.1609/aaai.v40i10.37750

Issue

Section

AAAI Technical Track on Computer Vision VII