EHR foundation models improve robustness in the presence of temporal distribution shift

Lin Lawrence Guo; Ethan Steinberg; Scott Lanyon Fleming; Jose Posada; Joshua Lemmon; Stephen R. Pfohl; Nigam Shah; Jason Fries; Lillian Sung

doi:10.1038/s41598-023-30820-8

Scientific Reports (Mar 2023)

EHR foundation models improve robustness in the presence of temporal distribution shift

Lin Lawrence Guo,
Ethan Steinberg,
Scott Lanyon Fleming,
Jose Posada,
Joshua Lemmon,
Stephen R. Pfohl,
Nigam Shah,
Jason Fries,
Lillian Sung

Affiliations

Lin Lawrence Guo: Program in Child Health Evaluative Sciences, The Hospital for Sick Children
Ethan Steinberg: Stanford Center for Biomedical Informatics Research, Stanford University
Scott Lanyon Fleming: Stanford Center for Biomedical Informatics Research, Stanford University
Jose Posada: Universidad del Norte
Joshua Lemmon: Program in Child Health Evaluative Sciences, The Hospital for Sick Children
Stephen R. Pfohl: Stanford Center for Biomedical Informatics Research, Stanford University
Nigam Shah: Stanford Center for Biomedical Informatics Research, Stanford University
Jason Fries: Stanford Center for Biomedical Informatics Research, Stanford University
Lillian Sung: Program in Child Health Evaluative Sciences, The Hospital for Sick Children

DOI: https://doi.org/10.1038/s41598-023-30820-8
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009–2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5–9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal