Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing

Authors

  • Jia-Xing Zhong ByteDance Inc.
  • Shijie Zhao ByteDance Inc.
  • Junlin Li ByteDance Inc.
  • Li Zhang ByteDance Inc.

DOI:

https://doi.org/10.1609/aaai.v40i16.38357

Abstract

Video-to-video human motion editing aims to transfer motion from a driving video to a reference video while preserving the background dynamics and the protagonist's original appearance. We identify critical limitations in existing methods that fail to capture the full complexity of human motions, particularly regarding: 1) location changes, 2) orientation variations, and 3) complicated non-upright poses. To address these challenges, we propose a framework that collaboratively "copies and pastes" 2D and 3D features across spatio-temporal dimensions into a shared representation space for motion guidance. Our approach achieves this through: 1) a mutual distillation mechanism that enhances the robustness and capability of individual encoders, and 2) a selective fusion module that adaptively weights and combines complementary information from spatio-temporal representations. To evaluate motion editing algorithms under challenging scenarios, we introduce a comprehensive benchmark dataset comprising real-world video clips from artistic gymnastics and figure skating competitions. These sports disciplines naturally encompass the three aforementioned aspects of motion complexity. Extensive experiments demonstrate that our approach significantly outperforms existing methods, particularly in handling intricate human motions.

Published

2026-03-14

How to Cite

Zhong, J.-X., Zhao, S., Li, J., & Zhang, L. (2026). Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13521–13529. https://doi.org/10.1609/aaai.v40i16.38357

Issue

Section

AAAI Technical Track on Computer Vision XIII