Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis

Shannon K. S. Kroes; Matthijs van Leeuwen; Rolf H. H. Groenwold; Mart P. Janssen

doi:10.3390/jcp3040040

Journal of Cybersecurity and Privacy (Dec 2023)

Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis

Shannon K. S. Kroes,
Matthijs van Leeuwen,
Rolf H. H. Groenwold,
Mart P. Janssen

Affiliations

Shannon K. S. Kroes: Netherlands Organisation for Applied Scientific Research (TNO), Anna van Buerenplein 1, 2595 DA The Hague, The Netherlands
Matthijs van Leeuwen: Leiden Institute of Advanced Computer Science, Leiden University, 2333 CA Leiden, The Netherlands
Rolf H. H. Groenwold: Department of Clinical Epidemiology, Leiden University Medical Center, 2333 ZA Leiden, The Netherlands
Mart P. Janssen: Transfusion Technology Assessment Group, Donor Medicine Research Department, Sanquin Research, 1066 CX Amsterdam, The Netherlands

DOI: https://doi.org/10.3390/jcp3040040
Journal volume & issue: Vol. 3, no. 4
pp. 882 – 894

Abstract

Read online

Synthetic data generation is becoming an increasingly popular approach to making privacy-sensitive data available for analysis. Recently, cluster-based synthetic data generation (CBSDG) has been proposed, which uses explainable and tractable techniques for privacy preservation. Although the algorithm demonstrated promising performance on simulated data, CBSDG has not yet been applied to real, personal data. In this work, a published blood-transfusion analysis is replicated with synthetic data to assess whether CBSDG can reproduce more complex and intricate variable relations than previously evaluated. Data from the Dutch national blood bank, consisting of 250,729 donation records, were used to predict donor hemoglobin (Hb) levels by means of support vector machines (SVMs). Precision scores were equal to the original data results for both male (0.997) and female (0.987) donors, recall was 0.007 higher for male and 0.003 lower for female donors (original estimates 0.739 and 0.637, respectively). The impact of the variables on Hb predictions was similar, as quantified and visualized with Shapley additive explanation values. Opportunities for attribute disclosure were decreased for all but two variables; only the binary variables Deferral Status and Sex could still be inferred. Such inference was also possible for donors who were not used as input for the generator and may result from correlations in the data as opposed to overfitting in the synthetic-data-generation process. The high predictive performance obtained with the synthetic data shows potential of CBSDG for practical implementation.

Published in Journal of Cybersecurity and Privacy

ISSN: 2624-800X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General)
Website: https://www.mdpi.com/journal/jcp

About the journal

Abstract

Keywords