MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models

Authors

  • Dexuan Xu Peking University
  • Jieyi Wang Peking University
  • Zhongyan Chai Peking University
  • Yongzhi Cao Peking University
  • Hanpin Wang Peking University
  • Huamin Zhang Institute of Basic Theory of Chinese Medicine, China Academy of Chinese Medical Sciences
  • Yu Huang Peking University

DOI:

https://doi.org/10.1609/aaai.v40i40.40705

Abstract

Recent advances in multimodal large language models (MLLMs) have significantly improved medical AI, enabling it to unify the understanding of visual and textual information. However, as medical knowledge continues to evolve, it is critical to allow these models to efficiently update outdated or incorrect information without retraining from scratch. Although textual knowledge editing has been widely studied, there is still a lack of systematic benchmarks for multimodal medical knowledge editing involving image and text modalities. To fill this gap, we present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing in medical multimodal large language models. MedMKEB is built on a high-quality medical visual question-answering dataset and enriched with carefully constructed editing tasks, including counterfactual correction, semantic generalization, knowledge transfer, and adversarial robustness. We incorporate human expert validation to ensure the accuracy and reliability of the benchmark. Extensive experiments on state-of-the-art general and medical MLLMs demonstrate the limitations of existing knowledge editing methods in the medical domain, highlighting the need to develop specialized editing strategies.

Downloads

Published

2026-03-14

How to Cite

Xu, D., Wang, J., Chai, Z., Cao, Y., Wang, H., Zhang, H., & Huang, Y. (2026). MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34106–34114. https://doi.org/10.1609/aaai.v40i40.40705

Issue

Section

AAAI Technical Track on Natural Language Processing V