Informatics in Medicine Unlocked (Jan 2024)
Comparison of nine machine learning regression models in predicting hospital length of stay for patients admitted to a general medicine department
Abstract
Background: The General Medicine (GM) department has the highest patient volume and heterogeneity among other hospital specialties. Closely examining hospitalization data is crucial because patients come with various conditions or traits. Length of stay (LoS) in hospitals is often used as an efficiency indicator. It is influenced by various factors, including the patient's medical background, demographics, and type of diseases/signs/symptoms at the triage. LoS is a variable that can vary widely, making it difficult to estimate it promptly and accurately, but doing so is highly beneficial. Moreover, efficiently grouping and managing patients based on their expected LoS remains a significant challenge for healthcare organizations. Objectives: This study aimed to compare the predictive ability of nine Machine Learning (ML) regression models in estimating the actual number of LoS days using demographics and clinical information recorded at admission as independent variables. Methods: We analyzed data collected on patients hospitalized at the GM department of the Sant'Orsola-Malpighi University Hospital in Bologna, Italy, who were admitted through the Emergency Department. The data were collected from January 1, 2022, to October 26, 2022. Nine ML regression models were used to predict LoS by analyzing historical data and patient information. The models' performance was assessed through root mean squared prediction error (RMSPE) and mean absolute prediction error (MAPE). Moreover, we used K-means clustering to group patients' medical and organizational criticalities (such as diseases, signs, symptoms, and administrative problems) into four clusters. Feature Importance plots and SHAP (SHapley Additive exPlanations) values were employed to identify the more essential features and enhance the interpretability of the results. Results: We analyzed the LoS of 3757 eligible patients, which showed an average of 13 days and a standard deviation of 11.8 days. We randomly divided patients into a training cohort of 2630 (70 %) and a test cohort of 1127 (30 %). The predictive performance of the different models was between 11.00 and 16.16 days for RMSPE and between 7.52 and 10.78 days for MAPE. The eXtreme Gradient Boosting Regression (XGBR) model had the lowest prediction error, both in terms of RMSPE (11.00 days) and MAE (7.52 days). Sex, arrival via own vehicle/walk-in, ambulance arrival, light blue risk category, age 70 or older, and orange risk category are some of the top features. Conclusion: The ML models evaluated in this study reported good predictive performance, with the XGBR model exhibiting the lowest prediction error. This model holds the potential to aid physicians in administering appropriate clinical interventions for patients in the GM department. This model can also help healthcare services predict the resources necessary to better manage hospitalization.