Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

Authors

  • Yue Ma Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
  • Yingqing He The Hong Kong University of Science and Technology, Hong Kong
  • Xiaodong Cun Tencent AI Lab, Shenzhen, China
  • Xintao Wang Tencent AI Lab, Shenzhen, China
  • Siran Chen Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen, China
  • Xiu Li Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
  • Qifeng Chen The Hong Kong University of Science and Technology, Hong Kong

DOI:

https://doi.org/10.1609/aaai.v38i5.28206

Keywords:

CV: Computational Photography, Image & Video Synthesis, CV: Language and Vision

Abstract

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models are available on https://follow-your-pose.github.io/.

Published

2024-03-24

How to Cite

Ma, Y., He, Y., Cun, X., Wang, X., Chen, S., Li, X., & Chen, Q. (2024). Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4117-4125. https://doi.org/10.1609/aaai.v38i5.28206

Issue

Section

AAAI Technical Track on Computer Vision IV