Predictive models of long COVIDResearch in context

Blessy Antony; Hannah Blau; Elena Casiraghi; Johanna J. Loomba; Tiffany J. Callahan; Bryan J. Laraway; Kenneth J. Wilkins; Corneliu C. Antonescu; Giorgio Valentini; Andrew E. Williams; Peter N. Robinson; Justin T. Reese; T.M. Murali; Christopher Chute

EBioMedicine (Oct 2023)

Predictive models of long COVIDResearch in context

Blessy Antony,
Hannah Blau,
Elena Casiraghi,
Johanna J. Loomba,
Tiffany J. Callahan,
Bryan J. Laraway,
Kenneth J. Wilkins,
Corneliu C. Antonescu,
Giorgio Valentini,
Andrew E. Williams,
Peter N. Robinson,
Justin T. Reese,
T.M. Murali,
Christopher Chute

Affiliations

Blessy Antony: Department of Computer Science, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, 24061, USA
Hannah Blau: The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
Elena Casiraghi: AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; ELLIS - European Laboratory for Learning and Intelligent Systems, Milan Unit, Milan, 20133, Italy
Johanna J. Loomba: Integrated Translational Health Research Institute of Virginia, University of Virginia, Charlottesville, VA, 22904, USA
Tiffany J. Callahan: Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
Bryan J. Laraway: Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
Kenneth J. Wilkins: Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, 20814, USA
Corneliu C. Antonescu: Banner Health, University of Arizona, Phoenix, AZ, 85006, USA
Giorgio Valentini: AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy; ELLIS - European Laboratory for Learning and Intelligent Systems, Milan Unit, Milan, 20133, Italy
Andrew E. Williams: Institute for Clinical Research and Health Policy Studies, Tufts University School of Medicine, Boston, MA, 02111, USA
Peter N. Robinson: The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, 06269, USA
Justin T. Reese: Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
T.M. Murali: Department of Computer Science, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, VA, 24061, USA; Corresponding author.
Christopher Chute

Journal volume & issue: Vol. 96
p. 104777

Abstract

Read online

Summary: Background: The cause and symptoms of long COVID are poorly understood. It is challenging to predict whether a given COVID-19 patient will develop long COVID in the future. Methods: We used electronic health record (EHR) data from the National COVID Cohort Collaborative to predict the incidence of long COVID. We trained two machine learning (ML) models — logistic regression (LR) and random forest (RF). Features used to train predictors included symptoms and drugs ordered during acute infection, measures of COVID-19 treatment, pre-COVID comorbidities, and demographic information. We assigned the ‘long COVID’ label to patients diagnosed with the U09.9 ICD10-CM code. The cohorts included patients with (a) EHRs reported from data partners using U09.9 ICD10-CM code and (b) at least one EHR in each feature category. We analysed three cohorts: all patients (n = 2,190,579; diagnosed with long COVID = 17,036), inpatients (149,319; 3,295), and outpatients (2,041,260; 13,741). Findings: LR and RF models yielded median AUROC of 0.76 and 0.75, respectively. Ablation study revealed that drugs had the highest influence on the prediction task. The SHAP method identified age, gender, cough, fatigue, albuterol, obesity, diabetes, and chronic lung disease as explanatory features. Models trained on data from one N3C partner and tested on data from the other partners had average AUROC of 0.75. Interpretation: ML-based classification using EHR information from the acute infection period is effective in predicting long COVID. SHAP methods identified important features for prediction. Cross-site analysis demonstrated the generalizability of the proposed methodology. Funding: NCATS U24 TR002306, NCATS UL1 TR003015, Axle Informatics Subcontract: NCATS-P00438-B, NIH/NIDDK/OD, PSR2015-1720GVALE_01, G43C22001320007, and Director, Office of Science, Office of Basic Energy Sciences of the U.S. Department of Energy Contract No. DE-AC02-05CH11231.

Published in EBioMedicine

ISSN: 2352-3964 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Medicine: Medicine (General)
Website: http://www.journals.elsevier.com/ebiomedicine/

About the journal

Abstract

Keywords