Enhancing Predictive Power of Cluster-Boosted Regression With Text-Based Indexing

Wutthipong Kongburan; Mark Chignell; Nipon Charoenkitkarn; Jonathan H. Chan

doi:10.1109/ACCESS.2019.2908032

IEEE Access (Jan 2019)

Enhancing Predictive Power of Cluster-Boosted Regression With Text-Based Indexing

Wutthipong Kongburan,
Mark Chignell,
Nipon Charoenkitkarn,
Jonathan H. Chan

Affiliations

Wutthipong Kongburan: ORCiD; Data Science and Engineering Laboratory, School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Mark Chignell: Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, Canada
Nipon Charoenkitkarn: Data Science and Engineering Laboratory, School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan: Data Science and Engineering Laboratory, School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand

DOI: https://doi.org/10.1109/ACCESS.2019.2908032
Journal volume & issue: Vol. 7
pp. 43394 – 43405

Abstract

Read online

Clustering prior to regression analysis improves the accuracy of prediction in clinical decision making. However, most previously described methods focused on numerical data only. This paper investigated how well textual features can improve the accuracy of regression predictions. Preliminary diagnosis, diagnosis summary, and drug names used in prescriptions as provided in the MIMIC II dataset were used to derive textual features. We proposed the bag-of-entities indexing method, which relies on named entity recognition, a machine learning technique used for locating and identifying words into predefined classes. The proposed technique captured meaningful phrases from texts in health records and represented them in numerical vector format. Dimensionality of the data space was reduced using principal component analysis. The additional well-tuned textual features were then combined with existing numerical features in using cluster-boosted regression to predict patient mortality in ICU. The experimental results showed prediction improvement obtained from textual features over the use of numerical features only. We found that using the proposed indexing method outperformed traditional word-vector representation approaches (bag-of-words and bag-of-bigrams) as well as a state-of-the-art approach (Doc2vec) in terms of resulting accuracy in predicting death status. Moreover, instead of directly interpreting, the identifiable individual features were grouped into types and summarized. The summarized de-identified data of textual features handled by the proposed framework can support predictive classification while also reducing privacy concerns. Grouping of similar patients based on their electronic health records also benefits physicians through the improved differential diagnosis and effective treatment planning.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords