OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study

Jiaxing Liu; Shalini Gupta; Aipeng Chen; Chen-Kai Wang; Pratik Mishra; Hong-Jie Dai; Zoie Shui-Yee Wong; Jitendra Jonnagaddala

doi:10.2196/48145

Journal of Medical Internet Research (Dec 2023)

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study

Jiaxing Liu,
Shalini Gupta,
Aipeng Chen,
Chen-Kai Wang,
Pratik Mishra,
Hong-Jie Dai,
Zoie Shui-Yee Wong,
Jitendra Jonnagaddala

Affiliations

Jiaxing Liu: ORCiD
Shalini Gupta: ORCiD
Aipeng Chen: ORCiD
Chen-Kai Wang: ORCiD
Pratik Mishra: ORCiD
Hong-Jie Dai: ORCiD
Zoie Shui-Yee Wong: ORCiD
Jitendra Jonnagaddala: ORCiD

DOI: https://doi.org/10.2196/48145
Journal volume & issue: Vol. 25
p. e48145

Abstract

Read online

BackgroundElectronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning–based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules. ObjectiveThe objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models. MethodsIn this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models. ResultsThe OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time. ConclusionsThe OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.

Published in Journal of Medical Internet Research

ISSN: 1438-8871 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Public aspects of medicine
Website: https://www.jmir.org

About the journal