IEEE Access (Jan 2019)
Enhancing Predictive Power of Cluster-Boosted Regression With Text-Based Indexing
Abstract
Clustering prior to regression analysis improves the accuracy of prediction in clinical decision making. However, most previously described methods focused on numerical data only. This paper investigated how well textual features can improve the accuracy of regression predictions. Preliminary diagnosis, diagnosis summary, and drug names used in prescriptions as provided in the MIMIC II dataset were used to derive textual features. We proposed the bag-of-entities indexing method, which relies on named entity recognition, a machine learning technique used for locating and identifying words into predefined classes. The proposed technique captured meaningful phrases from texts in health records and represented them in numerical vector format. Dimensionality of the data space was reduced using principal component analysis. The additional well-tuned textual features were then combined with existing numerical features in using cluster-boosted regression to predict patient mortality in ICU. The experimental results showed prediction improvement obtained from textual features over the use of numerical features only. We found that using the proposed indexing method outperformed traditional word-vector representation approaches (bag-of-words and bag-of-bigrams) as well as a state-of-the-art approach (Doc2vec) in terms of resulting accuracy in predicting death status. Moreover, instead of directly interpreting, the identifiable individual features were grouped into types and summarized. The summarized de-identified data of textual features handled by the proposed framework can support predictive classification while also reducing privacy concerns. Grouping of similar patients based on their electronic health records also benefits physicians through the improved differential diagnosis and effective treatment planning.
Keywords