Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing
DOI:
https://doi.org/10.1609/aaai.v40i16.38357Abstract
Video-to-video human motion editing aims to transfer motion from a driving video to a reference video while preserving the background dynamics and the protagonist's original appearance. We identify critical limitations in existing methods that fail to capture the full complexity of human motions, particularly regarding: 1) location changes, 2) orientation variations, and 3) complicated non-upright poses. To address these challenges, we propose a framework that collaboratively "copies and pastes" 2D and 3D features across spatio-temporal dimensions into a shared representation space for motion guidance. Our approach achieves this through: 1) a mutual distillation mechanism that enhances the robustness and capability of individual encoders, and 2) a selective fusion module that adaptively weights and combines complementary information from spatio-temporal representations. To evaluate motion editing algorithms under challenging scenarios, we introduce a comprehensive benchmark dataset comprising real-world video clips from artistic gymnastics and figure skating competitions. These sports disciplines naturally encompass the three aforementioned aspects of motion complexity. Extensive experiments demonstrate that our approach significantly outperforms existing methods, particularly in handling intricate human motions.Downloads
Published
2026-03-14
How to Cite
Zhong, J.-X., Zhao, S., Li, J., & Zhang, L. (2026). Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13521–13529. https://doi.org/10.1609/aaai.v40i16.38357
Issue
Section
AAAI Technical Track on Computer Vision XIII