IEEE Access (Jan 2024)

Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation: A Systematic Review

  • Pablo A. Osorio-Marulanda,
  • Gorka Epelde,
  • Mikel Hernandez,
  • Imanol Isasa,
  • Nicolas Moreno Reyes,
  • Andoni Beristain Iraola

DOI
https://doi.org/10.1109/ACCESS.2024.3417608
Journal volume & issue
Vol. 12
pp. 88048 – 88074

Abstract

Read online

The growth of data publishing, sharing, and mining mechanisms in various fields of industry and science has led to an increase in the flow of data, making it an important asset that needs to be protected and managed effectively. To this end, different mechanisms have been used across different domains, including Privacy Enhancing Technologies like Synthetic Data Generation, which aim to protect user-sensitive data and prevent misuse among different domains. Then, Synthetic data has been used not only to augment datasets and balance classes but also in applications of data analysis paradigms that aim to provide useful insights in terms of utility while preserving the privacy of sensitive data. Still, there is a gap in the conceptual and state-of-the-art understanding of the level of privacy synthetic data generators can provide and how they affect various industries and fields. This systematic review attempts to address how privacy has been assessed and measured in the framework of synthetic data generation, and getting to know which metrics have been used to evaluate those mechanisms. We provide an overview with a total of 105 recent studies in this field after a screening process and identify future open research directions. The main findings include a high prevalence of differential privacy as a privacy-preserving technique and privacy budget cost as a trade-off metric, with a high percentage of GAN-based model implementations, and mainly healthcare applications. Our systematic review covers multiple privacy domains and can be understood as a general framework for privacy measurement applied in Synthetic Data Generation.

Keywords