BIG-FUSION: Brain-Inspired Global-Local Context Fusion Framework for Multimodal Emotion Recognition in Conversations

Yusong Wang; Xuanye Fang; Huifeng Yin; Dongyuan Li; Guoqi Li; Qi Xu; Yi Xu; Shuai Zhong; Mingkun Xu

doi:10.1609/aaai.v39i2.32149

Authors

Yusong Wang Guangdong Institute of Intelligence Science and Technology Department of Information and Communications Engineering, Tokyo Institute of Technology
Xuanye Fang School of Computer Science and Technology, Dalian University of Technology
Huifeng Yin Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University
Dongyuan Li Department of Information and Communications Engineering, Tokyo Institute of Technology
Guoqi Li Institute of Automation, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Qi Xu School of Computer Science and Technology, Dalian University of Technology
Yi Xu School of Computer Science and Technology, Dalian University of Technology
Shuai Zhong Guangdong Institute of Intelligence Science and Technology
Mingkun Xu Guangdong Institute of Intelligence Science and Technology Center for Brain Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i2.32149

Abstract

Considering the importance of capturing both global conversational topics and local speaker dependencies for multimodal emotion recognition in conversations, current approaches first utilize sequence models like Transformer to extract global context information, then apply Graph Neural Networks to model local speaker dependencies for local context information extraction, coupled with Graph Contrastive Learning (GCL) to enhance node representation learning. However, this sequential design introduces potential biases: the extracted global context information inevitably influences subsequent processing, compromising the independence and diversity of the original local features; current graph augmentation methods in GCL cannot consider both global and local context information in conversations to evaluate the node importance, hindering the learning of key information. Inspired by the human brain excels at handling complex tasks by efficiently integrating local and global information processing mechanisms, we propose an aligned global-local context fusion framework for sequence-based design to address these problems. This design includes a dual-attention Transformer and a dual-evaluation method for graph augmentation in GCL. The dual-attention Transformer combines global attention for overall context extraction with sliding-window attention for local context capture, both enhanced by spiking neuron dynamics. The dual-evaluation method in GCL comprises global importance evaluation to identify nodes crucial for overall conversation context, and local importance evaluation to detect nodes significant for local semantics, generating augmented graph views that preserve both global and local information. This approach ensures balanced information processing throughout the pipeline, enhancing biological plausibility and achieving superior emotion recognition.

BIG-FUSION: Brain-Inspired Global-Local Context Fusion Framework for Multimodal Emotion Recognition in Conversations

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information