Global Epidemiology (Dec 2024)

Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction

  • Lise M. Bjerre,
  • Cayden Peixoto,
  • Rawan Alkurd,
  • Robert Talarico,
  • Rami Abielmona

Journal volume & issue
Vol. 8
p. 100168

Abstract

Read online

Background: Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited. Objectives: This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data. Methods: Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation. Results: The cohort consisted of n = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of n = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (p < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches. Conclusions: Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.

Keywords