DIUSum: Dynamic Image Utilization for Multimodal Summarization

Authors

  • Min Xiao State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Junnan Zhu State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Feifei Zhai State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China
  • Yu Zhou State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China
  • Chengqing Zong State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v38i17.29899

Keywords:

NLP: Summarization, NLP: Language Grounding & Multi-modal NLP

Abstract

Existing multimodal summarization approaches focus on fusing image features in the encoding process, ignoring the individualized needs for images when generating different summaries. However, whether intuitively or empirically, not all images can improve summary quality. Therefore, we propose a novel Dynamic Image Utilization framework for multimodal Summarization (DIUSum) to select and utilize valuable images for summarization. First, to predict whether an image helps produce a high-quality summary, we propose an image selector to score the usefulness of each image. Second, to dynamically utilize the multimodal information, we incorporate the hard and soft guidance from the image selector. Under the guidance, the image information is plugged into the decoder to generate a summary. Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets. Further analysis demonstrates that the image selector can reflect the improved level of summary quality brought by the images.

Published

2024-03-24

How to Cite

Xiao, M., Zhu, J., Zhai, F., Zhou, Y., & Zong, C. (2024). DIUSum: Dynamic Image Utilization for Multimodal Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19297-19305. https://doi.org/10.1609/aaai.v38i17.29899

Issue

Section

AAAI Technical Track on Natural Language Processing II