MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Scott L. Fleming; Alejandro Lozano; William J. Haberkorn; Jenelle A. Jindal; Eduardo Reis; Rahul Thapa; Louis Blankemeier; Julian Z. Genkins; Ethan Steinberg; Ashwin Nayak; Birju Patel; Chia-Chun Chiang; Alison Callahan; Zepeng Huo; Sergios Gatidis; Scott Adams; Oluseyi Fayanju; Shreya J. Shah; Thomas Savage; Ethan Goh; Akshay S. Chaudhari; Nima Aghaeepour; Christopher Sharp; Michael A. Pfeffer; Percy Liang; Jonathan H. Chen; Keith E. Morse; Emma P. Brunskill; Jason A. Fries; Nigam H. Shah

doi:10.1609/aaai.v38i20.30205

Authors

Scott L. Fleming Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA
Alejandro Lozano Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA
William J. Haberkorn Department of Anesthesiology, Peri-operative, and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
Jenelle A. Jindal Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Eduardo Reis Department of Radiology, Stanford School of Medicine, Stanford, CA, USA Center for Artificial Intelligence in Medicine and Imaging (AIMI), Stanford University, Stanford, CA, USA Hospital Israelita Albert Einstein, Sao Paulo, SP, Brazil
Rahul Thapa Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA
Louis Blankemeier Department of Electrical Engineering, Stanford School of Engineering, Stanford, CA
Julian Z. Genkins Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Ethan Steinberg Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Ashwin Nayak Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
Birju Patel Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Chia-Chun Chiang Department of Neurology, Mayo Clinic, Rochester, MN, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
Alison Callahan Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
Zepeng Huo Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Sergios Gatidis Department of Radiology, Stanford School of Medicine, Stanford, CA, USA
Scott Adams Department of Radiology, Stanford School of Medicine, Stanford, CA, USA
Oluseyi Fayanju Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
Shreya J. Shah Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
Thomas Savage Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Division of Hospital Medicine, Stanford University, Stanford, CA, USA
Ethan Goh Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA
Akshay S. Chaudhari Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Radiology, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
Nima Aghaeepour Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Anesthesiology, Peri-operative, and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
Christopher Sharp Department of Medicine, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
Michael A. Pfeffer Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
Percy Liang Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
Jonathan H. Chen Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA Division of Hospital Medicine, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA
Keith E. Morse Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA
Emma P. Brunskill Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA
Jason A. Fries Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Nigam H. Shah Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA

DOI:

https://doi.org/10.1609/aaai.v38i20.30205

Keywords:

General

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription