VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction
DOI:
https://doi.org/10.1609/aaai.v40i8.37567Abstract
Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora incorporating head dynamics and fine-grained multi-modality annotations limits the application of dialogue modeling. Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive, and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners. Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we propose the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude. Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.Downloads
Published
2026-03-14
How to Cite
Li, S., Qi, X., Yang, B., Chen, W., Tian, Z., Sun, M., … Sun, Z. (2026). VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6396–6404. https://doi.org/10.1609/aaai.v40i8.37567
Issue
Section
AAAI Technical Track on Computer Vision V