InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang; Junliang Guo; Jianhong Bai; Runyi Yu; Tianyu He; Xu Tan; Xu Sun; Jiang Bian

doi:10.1609/aaai.v39i8.32877

Authors

Yuchi Wang Peking University
Junliang Guo Peking University
Jianhong Bai Peking University
Runyi Yu Peking University
Tianyu He Peking University
Xu Tan Peking University
Xu Sun Peking University
Jiang Bian Peking University

DOI:

https://doi.org/10.1609/aaai.v39i8.32877

Abstract

Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we utilize GPT-4V to design an automatic annotation pipeline, constructing an instruction-video paired training dataset. This is combined with a novel two-branch diffusion-based generator to predict avatars using both audio and text instructions simultaneously. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness.

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information