A Hierarchical Network for Multimodal Document-Level Relation Extraction

Authors

  • Lingxing Kong National Key Laboratory for Novel Software Technology, Nanjing University
  • Jiuliang Wang National Key Laboratory for Novel Software Technology, Nanjing University
  • Zheng Ma National Key Laboratory for Novel Software Technology, Nanjing University Institute for AI Industry Research (AIR), Tsinghua University
  • Qifeng Zhou National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University
  • Jianbing Zhang National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University
  • Liang He National Key Laboratory for Novel Software Technology, Nanjing University
  • Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University

DOI:

https://doi.org/10.1609/aaai.v38i16.29801

Keywords:

NLP: Information Extraction, NLP: Language Grounding & Multi-modal NLP

Abstract

Document-level relation extraction aims to extract entity relations that span across multiple sentences. This task faces two critical issues: long dependency and mention selection. Prior works address the above problems from the textual perspective, however, it is hard to handle these problems solely based on text information. In this paper, we leverage video information to provide additional evidence for understanding long dependencies and offer a wider perspective for identifying relevant mentions, thus giving rise to a new task named Multimodal Document-level Relation Extraction (MDocRE). To tackle this new task, we construct a human-annotated dataset including documents and relevant videos, which, to the best of our knowledge, is the first document-level relation extraction dataset equipped with video clips. We also propose a hierarchical framework to learn interactions between different dependency levels and a textual-guided transformer architecture that incorporates both textual and video modalities. In addition, we utilize a mention gate module to address the mention-selection problem in both modalities. Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems.

Published

2024-03-24

How to Cite

Kong, L., Wang, J., Ma, Z., Zhou, Q., Zhang, J., He, L., & Chen, J. (2024). A Hierarchical Network for Multimodal Document-Level Relation Extraction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18408-18416. https://doi.org/10.1609/aaai.v38i16.29801

Issue

Section

AAAI Technical Track on Natural Language Processing I