End-to-end pseudonymization of fine-tuned clinical BERT models

Thomas Vakili; Aron Henriksson; Hercules Dalianis

doi:10.1186/s12911-024-02546-8

BMC Medical Informatics and Decision Making (Jun 2024)

End-to-end pseudonymization of fine-tuned clinical BERT models

Thomas Vakili,
Aron Henriksson,
Hercules Dalianis

Affiliations

Thomas Vakili: Department of Computer and Systems Sciences, Stockholm University
Aron Henriksson: Department of Computer and Systems Sciences, Stockholm University
Hercules Dalianis: Department of Computer and Systems Sciences, Stockholm University

DOI: https://doi.org/10.1186/s12911-024-02546-8
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords