BIG-FUSION: Brain-Inspired Global-Local Context Fusion Framework for Multimodal Emotion Recognition in Conversations

Authors

  • Yusong Wang Guangdong Institute of Intelligence Science and Technology Department of Information and Communications Engineering, Tokyo Institute of Technology
  • Xuanye Fang School of Computer Science and Technology, Dalian University of Technology
  • Huifeng Yin Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University
  • Dongyuan Li Department of Information and Communications Engineering, Tokyo Institute of Technology
  • Guoqi Li Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Qi Xu School of Computer Science and Technology, Dalian University of Technology
  • Yi Xu School of Computer Science and Technology, Dalian University of Technology
  • Shuai Zhong Guangdong Institute of Intelligence Science and Technology
  • Mingkun Xu Guangdong Institute of Intelligence Science and Technology Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i2.32149

Abstract

Considering the importance of capturing both global conversational topics and local speaker dependencies for multimodal emotion recognition in conversations, current approaches first utilize sequence models like Transformer to extract global context information, then apply Graph Neural Networks to model local speaker dependencies for local context information extraction, coupled with Graph Contrastive Learning (GCL) to enhance node representation learning. However, this sequential design introduces potential biases: the extracted global context information inevitably influences subsequent processing, compromising the independence and diversity of the original local features; current graph augmentation methods in GCL cannot consider both global and local context information in conversations to evaluate the node importance, hindering the learning of key information. Inspired by the human brain excels at handling complex tasks by efficiently integrating local and global information processing mechanisms, we propose an aligned global-local context fusion framework for sequence-based design to address these problems. This design includes a dual-attention Transformer and a dual-evaluation method for graph augmentation in GCL. The dual-attention Transformer combines global attention for overall context extraction with sliding-window attention for local context capture, both enhanced by spiking neuron dynamics. The dual-evaluation method in GCL comprises global importance evaluation to identify nodes crucial for overall conversation context, and local importance evaluation to detect nodes significant for local semantics, generating augmented graph views that preserve both global and local information. This approach ensures balanced information processing throughout the pipeline, enhancing biological plausibility and achieving superior emotion recognition.

Published

2025-04-11

How to Cite

Wang, Y., Fang, X., Yin, H., Li, D., Li, G., Xu, Q., … Xu, M. (2025). BIG-FUSION: Brain-Inspired Global-Local Context Fusion Framework for Multimodal Emotion Recognition in Conversations. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1574–1582. https://doi.org/10.1609/aaai.v39i2.32149

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems