MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation

Yiming Cao; Lizhen Cui; Lei Zhang; Fuqiang Yu; Zhen Li; Yonghui Xu

doi:10.1609/aaai.v37i1.25100

Authors

Yiming Cao Shandong University
Lizhen Cui Shandong University
Lei Zhang Shandong University
Fuqiang Yu Shangdong university
Zhen Li Qilu Hospital of Shandong University
Yonghui Xu Shandong University

DOI:

https://doi.org/10.1609/aaai.v37i1.25100

Keywords:

CV: Applications, CV: Medical and Biological Imaging, APP: Healthcare, Medicine & Wellness

Abstract

Automatic medical report generation is an essential task in applying artificial intelligence to the medical domain, which can lighten the workloads of doctors and promote clinical automation. The state-of-the-art approaches employ Transformer-based encoder-decoder architectures to generate reports for medical images. However, they do not fully explore the relationships between multi-modal medical data, and generate inaccurate and inconsistent reports. To address these issues, this paper proposes a Multi-modal Memory Transformer Network (MMTN) to cope with multi-modal medical data for generating image-report consistent medical reports. On the one hand, MMTN reduces the occurrence of image-report inconsistencies by designing a unique encoder to associate and memorize the relationship between medical images and medical terminologies. On the other hand, MMTN utilizes the cross-modal complementarity of the medical vision and language for the word prediction, which further enhances the accuracy of generating medical reports. Extensive experiments on three real datasets show that MMTN achieves significant effectiveness over state-of-the-art approaches on both automatic metrics and human evaluation.

MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription