npj Digital Medicine (Jan 2025)

Preserving information while respecting privacy through an information theoretic framework for synthetic health data generation

  • Nadir Sella,
  • Florent Guinot,
  • Nikita Lagrange,
  • Laurent-Philippe Albou,
  • Jonathan Desponds,
  • Hervé Isambert

DOI
https://doi.org/10.1038/s41746-025-01431-6
Journal volume & issue
Vol. 8, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Generating synthetic data from medical records is a complex task intensified by patient privacy concerns. In recent years, multiple approaches have been reported for the generation of synthetic data, however, limited attention was given to jointly evaluate the quality and the privacy of the generated data. The quality and privacy of synthetic data stem from multivariate associations across variables, which cannot be assessed by comparing univariate distributions with the original data. Here, we introduce a novel algorithm (MIIC-SDG) for generating synthetic data from electronic records based on a multivariate information framework and Bayesian network theory. We also propose a new metric to quantitatively assess the trade-off between the Quality and Privacy Scores (QPS) of synthetic data generation methods. The performance of MIIC-SDG is demonstrated on different clinical datasets and favorably compares with state-of-the-art synthetic data generation methods, based on the QPS trade-off between several quality and privacy metrics.