Survey on RNN and CRF models for de-identification of medical free text

Joffrey L. Leevy; Taghi M. Khoshgoftaar; Flavio Villanustre

doi:10.1186/s40537-020-00351-4

Journal of Big Data (Sep 2020)

Survey on RNN and CRF models for de-identification of medical free text

Joffrey L. Leevy,
Taghi M. Khoshgoftaar,
Flavio Villanustre

Affiliations

Joffrey L. Leevy: Florida Atlantic University
Taghi M. Khoshgoftaar: Florida Atlantic University
Flavio Villanustre: LexisNexis Business Information Solutions

DOI: https://doi.org/10.1186/s40537-020-00351-4
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 22

Abstract

Read online

Abstract The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification datasets are used during model training. Finally, statistical validation of performance scores and diversity during experimentation were largely ignored. In our comprehensive survey, we also identify major research gaps that should be considered for future work.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords