Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

Eunyoung Im; Hyeoneui Kim; Hyungbok Lee; Xiaoqian Jiang; Ju Han Kim

doi:10.1186/s12911-024-02545-9

BMC Medical Informatics and Decision Making (May 2024)

Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

Eunyoung Im,
Hyeoneui Kim,
Hyungbok Lee,
Xiaoqian Jiang,
Ju Han Kim

Affiliations

Eunyoung Im: College of Nursing, Seoul National University
Hyeoneui Kim: College of Nursing, Seoul National University
Hyungbok Lee: College of Nursing, Seoul National University
Xiaoqian Jiang: School of Biomedical Informatics, UTHealth
Ju Han Kim: Seoul National University Hospital

DOI: https://doi.org/10.1186/s12911-024-02545-9
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. Methods Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. Results All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. Conclusions As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords