Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing

Jia-Xing Zhong; Shijie Zhao; Junlin Li; Li Zhang

doi:10.1609/aaai.v40i16.38357

Authors

Jia-Xing Zhong ByteDance Inc.
Shijie Zhao ByteDance Inc.
Junlin Li ByteDance Inc.
Li Zhang ByteDance Inc.

DOI:

https://doi.org/10.1609/aaai.v40i16.38357

Abstract

Video-to-video human motion editing aims to transfer motion from a driving video to a reference video while preserving the background dynamics and the protagonist's original appearance. We identify critical limitations in existing methods that fail to capture the full complexity of human motions, particularly regarding: 1) location changes, 2) orientation variations, and 3) complicated non-upright poses. To address these challenges, we propose a framework that collaboratively "copies and pastes" 2D and 3D features across spatio-temporal dimensions into a shared representation space for motion guidance. Our approach achieves this through: 1) a mutual distillation mechanism that enhances the robustness and capability of individual encoders, and 2) a selective fusion module that adaptively weights and combines complementary information from spatio-temporal representations. To evaluate motion editing algorithms under challenging scenarios, we introduce a comprehensive benchmark dataset comprising real-world video clips from artistic gymnastics and figure skating competitions. These sports disciplines naturally encompass the three aforementioned aspects of motion complexity. Extensive experiments demonstrate that our approach significantly outperforms existing methods, particularly in handling intricate human motions.

Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information