VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Shiying Li; Xingqun Qi; Bingkun Yang; Weile Chen; Zezhao Tian; Muyi Sun; Qifeng Liu; Man Zhang; Zhenan Sun

doi:10.1609/aaai.v40i8.37567

Authors

Shiying Li Beijing University of Posts and Telecommunications
Xingqun Qi Hong Kong University of Science and Technology
Bingkun Yang Beijing University of Posts and Telecommunications
Weile Chen Zhejiang University
Zezhao Tian Beijing University of Posts and Telecommunications
Muyi Sun Beijing University of Posts and Telecommunications
Qifeng Liu The Hong Kong University of Science and Technology
Man Zhang Beijing University of Posts and Telecommunications
Zhenan Sun Institute of automation, Chinese academy of science, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i8.37567

Abstract

Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora incorporating head dynamics and fine-grained multi-modality annotations limits the application of dialogue modeling. Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive, and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners. Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we propose the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude. Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.

VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information