Application of machine learning and natural language processing for predicting stroke-associated pneumonia

Hui-Chu Tsai; Cheng-Yang Hsieh; Cheng-Yang Hsieh; Sheng-Feng Sung; Sheng-Feng Sung

doi:10.3389/fpubh.2022.1009164

Frontiers in Public Health (Sep 2022)

Application of machine learning and natural language processing for predicting stroke-associated pneumonia

Hui-Chu Tsai,
Cheng-Yang Hsieh,
Cheng-Yang Hsieh,
Sheng-Feng Sung,
Sheng-Feng Sung

Affiliations

Hui-Chu Tsai: Department of Radiology, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi, Taiwan
Cheng-Yang Hsieh: Department of Neurology, Tainan Sin Lau Hospital, Tainan, Taiwan
Cheng-Yang Hsieh: School of Pharmacy, Institute of Clinical Pharmacy and Pharmaceutical Sciences, College of Medicine, National Cheng Kung University, Tainan, Taiwan
Sheng-Feng Sung: Division of Neurology, Department of Internal Medicine, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi, Taiwan
Sheng-Feng Sung: Department of Nursing, Min-Hwei Junior College of Health Care Management, Tainan, Taiwan

DOI: https://doi.org/10.3389/fpubh.2022.1009164
Journal volume & issue: Vol. 10

Abstract

Read online

BackgroundIdentifying patients at high risk of stroke-associated pneumonia (SAP) may permit targeting potential interventions to reduce its incidence. We aimed to explore the functionality of machine learning (ML) and natural language processing techniques on structured data and unstructured clinical text to predict SAP by comparing it to conventional risk scores.MethodsLinked data between a hospital stroke registry and a deidentified research-based database including electronic health records and administrative claims data was used. Natural language processing was applied to extract textual features from clinical notes. The random forest algorithm was used to build ML models. The predictive performance of ML models was compared with the A2DS2, ISAN, PNA, and ACDD4 scores using the area under the receiver operating characteristic curve (AUC).ResultsAmong 5,913 acute stroke patients hospitalized between Oct 2010 and Sep 2021, 450 (7.6%) developed SAP within the first 7 days after stroke onset. The ML model based on both textual features and structured variables had the highest AUC [0.840, 95% confidence interval (CI) 0.806–0.875], significantly higher than those of the ML model based on structured variables alone (0.828, 95% CI 0.793–0.863, P = 0.040), ACDD4 (0.807, 95% CI 0.766–0.849, P = 0.041), A2DS2 (0.803, 95% CI 0.762–0.845, P = 0.013), ISAN (0.795, 95% CI 0.752–0.837, P = 0.009), and PNA (0.778, 95% CI 0.735–0.822, P < 0.001). All models demonstrated adequate calibration except for the A2DS2 score.ConclusionsThe ML model based on both textural features and structured variables performed better than conventional risk scores in predicting SAP. The workflow used to generate ML prediction models can be disseminated for local adaptation by individual healthcare organizations.

Published in Frontiers in Public Health

ISSN: 2296-2565 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Public aspects of medicine
Website: https://www.frontiersin.org/journals/public-health

About the journal

Abstract

Keywords