Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

Karthik Murugadoss; Ajit Rajasekharan; Bradley Malin; Vineet Agarwal; Sairam Bade; Jeff R. Anderson; Jason L. Ross; William A. Faubion, Jr.; John D. Halamka; Venky Soundararajan; Sankar Ardhanari

doi:10.1016/j.patter.2021.100255

Patterns (Jun 2021)

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

Karthik Murugadoss,
Ajit Rajasekharan,
Bradley Malin,
Vineet Agarwal,
Sairam Bade,
Jeff R. Anderson,
Jason L. Ross,
William A. Faubion, Jr.,
John D. Halamka,
Venky Soundararajan,
Sankar Ardhanari

Affiliations

Karthik Murugadoss: nference, Cambridge, MA 02142, USA
Ajit Rajasekharan: nference, Cambridge, MA 02142, USA
Bradley Malin: Vanderbilt University Medical Center, Nashville, TN 37232, USA
Vineet Agarwal: nference, Cambridge, MA 02142, USA
Sairam Bade: nference Labs, Bangalore, India
Jeff R. Anderson: Mayo Clinic, Rochester, MN 55905, USA; Mayo Clinic Platform, Rochester, MN 55905, USA
Jason L. Ross: nference, Cambridge, MA 02142, USA
William A. Faubion, Jr.: Mayo Clinic, Rochester, MN 55905, USA
John D. Halamka: Mayo Clinic, Rochester, MN 55905, USA; Mayo Clinic Platform, Rochester, MN 55905, USA
Venky Soundararajan: nference, Cambridge, MA 02142, USA; Corresponding author
Sankar Ardhanari: nference, Cambridge, MA 02142, USA; Corresponding author

DOI: https://doi.org/10.1016/j.patter.2021.100255
Journal volume & issue: Vol. 2, no. 6
p. 100255

Abstract

Read online

Summary: The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries. The bigger picture: Clinical notes in electronic health records convey rich historical information regarding disease and treatment progression. However, this unstructured text often contains personally identifiable information such as names, phone numbers, or residential addresses of patients, thereby limiting its dissemination for research purposes. The removal of patient identifiers, through the process of de-identification, enables sharing of clinical data while preserving patient privacy. Here, we present a best-in-class approach to de-identification, which automatically detects identifiers and substitutes them with fabricated ones. Our approach enables de-identification of patient data at the scale required to harness the unstructured, context-rich information in electronic health records to aid in medical research and advancement.

Published in Patterns

ISSN: 2666-3899 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.cell.com/patterns

About the journal

Abstract

Keywords