A Hierarchical Network for Multimodal Document-Level Relation Extraction
DOI:
https://doi.org/10.1609/aaai.v38i16.29801Keywords:
NLP: Information Extraction, NLP: Language Grounding & Multi-modal NLPAbstract
Document-level relation extraction aims to extract entity relations that span across multiple sentences. This task faces two critical issues: long dependency and mention selection. Prior works address the above problems from the textual perspective, however, it is hard to handle these problems solely based on text information. In this paper, we leverage video information to provide additional evidence for understanding long dependencies and offer a wider perspective for identifying relevant mentions, thus giving rise to a new task named Multimodal Document-level Relation Extraction (MDocRE). To tackle this new task, we construct a human-annotated dataset including documents and relevant videos, which, to the best of our knowledge, is the first document-level relation extraction dataset equipped with video clips. We also propose a hierarchical framework to learn interactions between different dependency levels and a textual-guided transformer architecture that incorporates both textual and video modalities. In addition, we utilize a mention gate module to address the mention-selection problem in both modalities. Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems.Downloads
Published
2024-03-24
How to Cite
Kong, L., Wang, J., Ma, Z., Zhou, Q., Zhang, J., He, L., & Chen, J. (2024). A Hierarchical Network for Multimodal Document-Level Relation Extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18408-18416. https://doi.org/10.1609/aaai.v38i16.29801
Issue
Section
AAAI Technical Track on Natural Language Processing I