LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract)

Zhengkun Zhang; Jun Wang; Zhe Sun; Zhenglu Yang

doi:10.1609/aaai.v35i18.17971

Authors

Zhengkun Zhang Nankai University
Jun Wang Ludong University
Zhe Sun RIKEN
Zhenglu Yang Nankai University

DOI:

https://doi.org/10.1609/aaai.v35i18.17971

Keywords:

Multimodal, Summarization, Image Location

Abstract

Multimodal summarization aims to refine salient information from multiple modalities, among which texts and images are two mostly discussed ones. In recent years, many fantastic works have emerged in this field by modeling image-text interactions; however, they neglect the fact that most of multimodal documents have been elaborately organized by their writers. This means that a critical organized factor has long been short of enough attention, that is, image locations, which may carry illuminating information and imply the key contents of a document. To address this issue, we propose a location-aware approach for multimodal summarization (LAMS) based on Transformer. We investigate image locations for multimodal summarization via a stack of multimodal fusion block, which can formulate the high-order interactions among images and texts. An extensive experimental study on an extended multimodal dataset validates the superior summarization performance of the proposed model.

LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract)

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information