A Hierarchical Network for Multimodal Document-Level Relation Extraction

Lingxing Kong; Jiuliang Wang; Zheng Ma; Qifeng Zhou; Jianbing Zhang; Liang He; Jiajun Chen

doi:10.1609/aaai.v38i16.29801

Authors

Lingxing Kong National Key Laboratory for Novel Software Technology, Nanjing University
Jiuliang Wang National Key Laboratory for Novel Software Technology, Nanjing University
Zheng Ma National Key Laboratory for Novel Software Technology, Nanjing University Institute for AI Industry Research (AIR), Tsinghua University
Qifeng Zhou National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University
Jianbing Zhang National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University
Liang He National Key Laboratory for Novel Software Technology, Nanjing University
Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University

DOI:

https://doi.org/10.1609/aaai.v38i16.29801

Keywords:

NLP: Information Extraction, NLP: Language Grounding & Multi-modal NLP

Abstract

Document-level relation extraction aims to extract entity relations that span across multiple sentences. This task faces two critical issues: long dependency and mention selection. Prior works address the above problems from the textual perspective, however, it is hard to handle these problems solely based on text information. In this paper, we leverage video information to provide additional evidence for understanding long dependencies and offer a wider perspective for identifying relevant mentions, thus giving rise to a new task named Multimodal Document-level Relation Extraction (MDocRE). To tackle this new task, we construct a human-annotated dataset including documents and relevant videos, which, to the best of our knowledge, is the first document-level relation extraction dataset equipped with video clips. We also propose a hierarchical framework to learn interactions between different dependency levels and a textual-guided transformer architecture that incorporates both textual and video modalities. In addition, we utilize a mention gate module to address the mention-selection problem in both modalities. Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems.

A Hierarchical Network for Multimodal Document-Level Relation Extraction

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information