IEEE Access (Jan 2025)

Can Synthetic Data Protect Privacy?

  • Gidan Min,
  • Junhyoung Oh

DOI
https://doi.org/10.1109/access.2025.3542266
Journal volume & issue
Vol. 13
pp. 31544 – 31561

Abstract

Read online

To systematically evaluate the privacy protection performance of synthetic data generation algorithms (Synthpop, CTGAN, RTVAE, TVAE, DataSynthesizer), this study applied various safety metrics. Synthetic data is designed to protect sensitive information while maintaining statistical similarities to the original data, but a high degree of similarity can increase the risk of re-identification. Therefore, privacy protection was measured using metrics such as DCR, NNDR, Identification Risk Indicator, Inference Risk Indicator, CM3, DUPI, and pMSE. The results showed that Synthpop provided high data utility, but its high similarity to the original data posed significant privacy risks. Conversely, DataSynthesizer and CTGAN demonstrated superior privacy protection by balancing utility and privacy effectively. RTVAE and TVAE maintained a clear distinction from the original data, enhancing privacy protection, though some cases showed decreased data utility. These findings suggest the importance of selecting algorithms based on specific privacy and utility requirements, emphasizing the need to consider the trade-off between data utility and privacy protection depending on dataset characteristics.

Keywords