Preventive Medicine Reports (Sep 2024)

An early sepsis prediction model utilizing machine learning and unbalanced data processing in a clinical context

  • Luyao Zhou,
  • Min Shao,
  • Cui Wang,
  • Yu Wang

Journal volume & issue
Vol. 45
p. 102841

Abstract

Read online

Background: Early and accurate diagnoses of sepsis patients are essential to reduce the mortality. However, the sepsis is still diagnosed in a traditional way in China despite the increasing number of related studies, which may to some extent lead to delays in the treatment. Methods: The study included 2,385 patients, including 364 with sepsis, collected from the First Affiliated Hospital of Anhui Medical University and partner hospitals from April to July 2022. External validation was conducted using the MIMIC-III database (over 60,000 patients from 2001 to 2012) and the eICU Collaborative Research Database (139,000 patients from 2014 to 2015). Multiple algorithm models, along with the SHapley Additive exPlanations (SHAP) analysis, are applied to explore the main risk factors for the accurate prediction of the sepsis. Multiple Imputations for filling missing data and the Synthetic Minority Oversampling (SMOTE) balancing method for balancing data are used for the data processing. Result: Eighteen diagnostic features are used in the predictive model for early sepsis. The Random Forest model has the best performance among all the models, with an Area Under the Curve (AUC) of 87% and an F1-score (F1) of 77%. Moreover, the interpretation from the SHAP analysis is generally consistent with the current clinical situation. Conclusion: The study revealed the relationship between these 18 clinical features and diagnostic outcomes. The results indicate that patients with laboratory values of Systolic Blood Pressure, Albumin, and Heart Rate exceeding certain thresholds are at a high likelihood of developing sepsis.

Keywords