Future Internet (Aug 2024)

Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text

  • Martin Baumgartner,
  • Karl Kreiner,
  • Fabian Wiesmüller,
  • Dieter Hayn,
  • Christian Puelacher,
  • Günter Schreier

DOI
https://doi.org/10.3390/fi16080281
Journal volume & issue
Vol. 16, no. 8
p. 281

Abstract

Read online

Background: The recent rise of large language models has triggered renewed interest in medical free text data, which holds critical information about patients and diseases. However, medical free text is also highly sensitive. Therefore, de-identification is typically required but is complicated since medical free text is mostly unstructured. With the Masketeer algorithm, we present an effective tool to de-identify German medical text. Methods: We used an ensemble of different masking classes to remove references to identifiable data from over 35,000 clinical notes in accordance with the HIPAA Safe Harbor Guidelines. To retain additional context for readers, we implemented an entity recognition scheme and corpus-wide pseudonymization. Results: The algorithm performed with a sensitivity of 0.943 and specificity of 0.933. Further performance analyses showed linear runtime complexity (O(n)) with both increasing text length and corpus size. Conclusions: In the future, large language models will likely be able to de-identify medical free text more effectively and thoroughly than handcrafted rules. However, such gold-standard de-identification tools based on large language models are yet to emerge. In the current absence of such, we hope to provide best practices for a robust rule-based algorithm designed with expert domain knowledge.

Keywords