InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Authors

  • Yuchi Wang Peking University
  • Junliang Guo Peking University
  • Jianhong Bai Peking University
  • Runyi Yu Peking University
  • Tianyu He Peking University
  • Xu Tan Peking University
  • Xu Sun Peking University
  • Jiang Bian Peking University

DOI:

https://doi.org/10.1609/aaai.v39i8.32877

Abstract

Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we utilize GPT-4V to design an automatic annotation pipeline, constructing an instruction-video paired training dataset. This is combined with a novel two-branch diffusion-based generator to predict avatars using both audio and text instructions simultaneously. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness.

Downloads

Published

2025-04-11

How to Cite

Wang, Y., Guo, J., Bai, J., Yu, R., He, T., Tan, X., … Bian, J. (2025). InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8132–8140. https://doi.org/10.1609/aaai.v39i8.32877

Issue

Section

AAAI Technical Track on Computer Vision VII