MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation


  • Yiming Cao Shandong University
  • Lizhen Cui Shandong University
  • Lei Zhang Shandong University
  • Fuqiang Yu Shangdong university
  • Zhen Li Qilu Hospital of Shandong University
  • Yonghui Xu Shandong University



CV: Applications, CV: Medical and Biological Imaging, APP: Healthcare, Medicine & Wellness


Automatic medical report generation is an essential task in applying artificial intelligence to the medical domain, which can lighten the workloads of doctors and promote clinical automation. The state-of-the-art approaches employ Transformer-based encoder-decoder architectures to generate reports for medical images. However, they do not fully explore the relationships between multi-modal medical data, and generate inaccurate and inconsistent reports. To address these issues, this paper proposes a Multi-modal Memory Transformer Network (MMTN) to cope with multi-modal medical data for generating image-report consistent medical reports. On the one hand, MMTN reduces the occurrence of image-report inconsistencies by designing a unique encoder to associate and memorize the relationship between medical images and medical terminologies. On the other hand, MMTN utilizes the cross-modal complementarity of the medical vision and language for the word prediction, which further enhances the accuracy of generating medical reports. Extensive experiments on three real datasets show that MMTN achieves significant effectiveness over state-of-the-art approaches on both automatic metrics and human evaluation.




How to Cite

Cao, Y., Cui, L., Zhang, L., Yu, F., Li, Z., & Xu, Y. (2023). MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 277-285.



AAAI Technical Track on Computer Vision I