MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Authors

  • Scott L. Fleming Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA
  • Alejandro Lozano Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA
  • William J. Haberkorn Department of Anesthesiology, Peri-operative, and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
  • Jenelle A. Jindal Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
  • Eduardo Reis Department of Radiology, Stanford School of Medicine, Stanford, CA, USA Center for Artificial Intelligence in Medicine and Imaging (AIMI), Stanford University, Stanford, CA, USA Hospital Israelita Albert Einstein, Sao Paulo, SP, Brazil
  • Rahul Thapa Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA
  • Louis Blankemeier Department of Electrical Engineering, Stanford School of Engineering, Stanford, CA
  • Julian Z. Genkins Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
  • Ethan Steinberg Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
  • Ashwin Nayak Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
  • Birju Patel Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
  • Chia-Chun Chiang Department of Neurology, Mayo Clinic, Rochester, MN, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
  • Alison Callahan Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
  • Zepeng Huo Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
  • Sergios Gatidis Department of Radiology, Stanford School of Medicine, Stanford, CA, USA
  • Scott Adams Department of Radiology, Stanford School of Medicine, Stanford, CA, USA
  • Oluseyi Fayanju Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
  • Shreya J. Shah Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
  • Thomas Savage Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Division of Hospital Medicine, Stanford University, Stanford, CA, USA
  • Ethan Goh Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA
  • Akshay S. Chaudhari Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Radiology, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
  • Nima Aghaeepour Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Anesthesiology, Peri-operative, and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
  • Christopher Sharp Department of Medicine, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
  • Michael A. Pfeffer Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
  • Percy Liang Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
  • Jonathan H. Chen Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA Division of Hospital Medicine, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA
  • Keith E. Morse Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
  • Emma P. Brunskill Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
  • Jason A. Fries Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
  • Nigam H. Shah Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA

DOI:

https://doi.org/10.1609/aaai.v38i20.30205

Keywords:

General

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

Published

2024-03-24

How to Cite

Fleming, S. L., Lozano, A., Haberkorn, W. J., Jindal, J. A., Reis, E., Thapa, R., Blankemeier, L., Genkins, J. Z., Steinberg, E., Nayak, A., Patel, B., Chiang, C.-C., Callahan, A., Huo, Z., Gatidis, S., Adams, S., Fayanju, O., Shah, S. J., Savage, T., Goh, E., Chaudhari, A. S., Aghaeepour, N., Sharp, C., Pfeffer, M. A., Liang, P., Chen, J. H., Morse, K. E., Brunskill, E. P., Fries, J. A., & Shah, N. H. (2024). MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Proceedings of the AAAI Conference on Artificial Intelligence, 38(20), 22021-22030. https://doi.org/10.1609/aaai.v38i20.30205