Applied Sciences (Jul 2024)
Challenges of Using Synthetic Data Generation Methods for Tabular Microdata
Abstract
The generation of synthetic data holds significant promise for augmenting limited datasets while avoiding privacy issues, facilitating research, and enhancing machine learning models’ robustness. Generative Adversarial Networks (GANs) stand out as promising tools, employing two neural networks—generator and discriminator—to produce synthetic data that mirrors real data distributions. This study evaluates GAN variants (CTGAN, CopulaGAN), a variational autoencoder, and copulas on diverse real datasets of different complexity encompassing numerical and categorical attributes. The results highlight CTGAN’s sensitivity to training parameters and TVAE’s robustness across datasets. Scalability challenges persist, with GANs demanding substantial computational resources. TVAE stands out for its high utility across all datasets, even for high-dimensional data, though it incurs higher privacy risks, which is indicative of the curse of dimensionality. While no single model universally excels, understanding the trade-offs and leveraging model strengths can significantly enhance synthetic data generation (SDG). Future research should focus on adaptive learning mechanisms, scalability enhancements, and standardized evaluation metrics to advance SDG methods effectively. Addressing these challenges will foster broader adoption and application of synthetic data.
Keywords