Journal of Big Data (Sep 2020)

Survey on RNN and CRF models for de-identification of medical free text

  • Joffrey L. Leevy,
  • Taghi M. Khoshgoftaar,
  • Flavio Villanustre

DOI
https://doi.org/10.1186/s40537-020-00351-4
Journal volume & issue
Vol. 7, no. 1
pp. 1 – 22

Abstract

Read online

Abstract The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification datasets are used during model training. Finally, statistical validation of performance scores and diversity during experimentation were largely ignored. In our comprehensive survey, we also identify major research gaps that should be considered for future work.

Keywords