Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

Authors

  • David Restrepo Massachusetts Institute of Technology
  • Chenwei Wu University of Michigan - Ann Arbor
  • Zhengxu Tang University of Michigan - Ann Arbor
  • Zitao Shuai University of Michigan - Ann Arbor
  • Thao Nguyen Minh Phan National Yang Ming Chiao Tung University
  • Jun-En Ding Stevens Institute of Technology
  • Cong-Tinh Dao National Yang Ming Chiao Tung University
  • Jack Gallifant Harvard University
  • Robyn Gayle Dychiao University of the Philippines
  • Jose Carlo Artiaga University of the Philippines
  • André Hiroshi Bando Universidade Federal de São Paulo
  • Carolina Pelegrini Barbosa Gracitelli Universidade Federal de São Paulo
  • Vincenz Ferrer University of the Philippines
  • Leo Anthony Celi Massachusetts Institute of Technology
  • Danielle Bitterman Harvard University
  • Michael G Morley Harvard University
  • Luis Filipe Nakayama Massachusetts Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i27.35053

Abstract

Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

Downloads

Published

2025-04-11

How to Cite

Restrepo, D., Wu, C., Tang, Z., Shuai, Z., Phan, T. N. M., Ding, J.-E., … Nakayama, L. F. (2025). Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 28321–28330. https://doi.org/10.1609/aaai.v39i27.35053