LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract)
Keywords:Multimodal, Summarization, Image Location
AbstractMultimodal summarization aims to refine salient information from multiple modalities, among which texts and images are two mostly discussed ones. In recent years, many fantastic works have emerged in this field by modeling image-text interactions; however, they neglect the fact that most of multimodal documents have been elaborately organized by their writers. This means that a critical organized factor has long been short of enough attention, that is, image locations, which may carry illuminating information and imply the key contents of a document. To address this issue, we propose a location-aware approach for multimodal summarization (LAMS) based on Transformer. We investigate image locations for multimodal summarization via a stack of multimodal fusion block, which can formulate the high-order interactions among images and texts. An extensive experimental study on an extended multimodal dataset validates the superior summarization performance of the proposed model.
How to Cite
Zhang, Z., Wang, J., Sun, Z., & Yang, Z. (2021). LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 35(18), 15949-15950. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17971
AAAI Student Abstract and Poster Program