Big Data and Cognitive Computing (Aug 2021)
A Simple Free-Text-like Method for Extracting Semi-Structured Data from Electronic Health Records: Exemplified in Prediction of In-Hospital Mortality
Abstract
The Epic electronic health record (EHR) is a commonly used EHR in the United States. This EHR contain large semi-structured “flowsheet” fields. Flowsheet fields lack a well-defined data dictionary and are unique to each site. We evaluated a simple free-text-like method to extract these data. As a use case, we demonstrate this method in predicting mortality during emergency department (ED) triage. We retrieved demographic and clinical data for ED visits from the Epic EHR (1/2014–12/2018). Data included structured, semi-structured flowsheet records and free-text notes. The study outcome was in-hospital death within 48 h. Most of the data were coded using a free-text-like Bag-of-Words (BoW) approach. Two machine-learning models were trained: gradient boosting and logistic regression. Term frequency-inverse document frequency was employed in the logistic regression model (LR-tf-idf). An ensemble of LR-tf-idf and gradient boosting was evaluated. Models were trained on years 2014–2017 and tested on year 2018. Among 412,859 visits, the 48-h mortality rate was 0.2%. LR-tf-idf showed AUC 0.98 (95% CI: 0.98–0.99). Gradient boosting showed AUC 0.97 (95% CI: 0.96–0.99). An ensemble of both showed AUC 0.99 (95% CI: 0.98–0.99). In conclusion, a free-text-like approach can be useful for extracting knowledge from large amounts of complex semi-structured EHR data.
Keywords