Classification of Severe Maternal Morbidity from Electronic Health Records Written in Spanish Using Natural Language Processing

Ever A. Torres-Silva; Santiago Rúa; Andrés F. Giraldo-Forero; Maria C. Durango; José F. Flórez-Arango; Andrés Orozco-Duque

doi:10.3390/app131910725

Applied Sciences (Sep 2023)

Classification of Severe Maternal Morbidity from Electronic Health Records Written in Spanish Using Natural Language Processing

Ever A. Torres-Silva,
Santiago Rúa,
Andrés F. Giraldo-Forero,
Maria C. Durango,
José F. Flórez-Arango,
Andrés Orozco-Duque

Affiliations

Ever A. Torres-Silva: Faculty of Engineering, Instituto Tecnológico Metropolitano, Medellín 050034, Colombia
Santiago Rúa: School of Basic Sciences, Technologies and Engineering, Universidad Nacional Abierta y a Distancia, Bogota 111321, Colombia
Andrés F. Giraldo-Forero: Faculty of Engineering, Instituto Tecnológico Metropolitano, Medellín 050034, Colombia
Maria C. Durango: Department of Applied Sciences, Instituto Tecnológico Metropolitano, Medellín 050034, Colombia
José F. Flórez-Arango: Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
Andrés Orozco-Duque: Department of Applied Sciences, Instituto Tecnológico Metropolitano, Medellín 050034, Colombia

DOI: https://doi.org/10.3390/app131910725
Journal volume & issue: Vol. 13, no. 19
p. 10725

Abstract

Read online

One stepping stone for reducing the maternal mortality is to identify severe maternal morbidity (SMM) using Electronic Health Records (EHRs). We aim to develop a pipeline to represent and classify the unstructured text of maternal progress notes in eight classes according to the silver labels defined by the ICD-10 codes associated with SMM. We preprocessed the text, removing protected health information (PHI) and reducing stop words. We built different pipelines to classify the SMM by the combination of six word-embeddings schemes, three different approaches for the representation of the documents (average, clustering, and principal component analysis), and five well-known machine learning classifiers. Additionally, we implemented an algorithm for typos and misspelling adjustment based on the Levenshtein distance to the Spanish Billion Word Corpus dictionary. We analyzed 43,529 documents constructed by an average of 4.15 progress notes from 22,937 patients. The pipeline with the best performance was the one that included Word2Vec, typos and spelling adjustment, document representation by PCA, and an SVM classifier. We found that it is possible to identify conditions such as miscarriage complication or hypertensive disorders from clinical notes written in Spanish, with a true positive rate higher than 0.85. This is the first approach to classify SMM from the unstructured text contained in the maternal EHRs, which can contribute to the solution of one of the most important public health problems in the world. Future works must test other representation and classification approaches to detect the risk of SMM.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords