UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation


  • Zhengkun Zhang Nankai University
  • Xiaojun Meng Noah's Ark Lab, Huawei Technologies
  • Yasheng Wang Noah's Ark Lab, Huawei Technologies
  • Xin Jiang Noah's Ark Lab, Huawei Technologies
  • Qun Liu Noah's Ark Lab, Huawei Technologies
  • Zhenglu Yang Nankai University




Speech & Natural Language Processing (SNLP), Computer Vision (CV)


With the rapid increase of multimedia data, a large body of literature has emerged to work on multimodal summarization, the majority of which target at refining salient information from textual and image modalities to output a pictorial summary with the most relevant images. Existing methods mostly focus on either extractive or abstractive summarization and rely on the presence and quality of image captions to build image references. We are the first to propose a Unified framework for Multimodal Summarization grounding on BART, UniMS, that integrates extractive and abstractive objectives, as well as selecting the image output. Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions. Besides, we introduce a visual guided decoder to better integrate textual and visual modalities in guiding abstractive text generation. Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.




How to Cite

Zhang, Z., Meng, X., Wang, Y., Jiang, X., Liu, Q., & Yang, Z. (2022). UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11757-11764. https://doi.org/10.1609/aaai.v36i10.21431



AAAI Technical Track on Speech and Natural Language Processing