A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator

A. Kiran; S. Saravana Kumar

doi:10.1109/ACCESS.2024.3354277

IEEE Access (Jan 2024)

A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator

A. Kiran,
S. Saravana Kumar

Affiliations

A. Kiran: ORCiD; Department of Computer Science and Engineering (CSE), School of Engineering and Technology (SOET), CMR University, Bengaluru, Karnataka, India
S. Saravana Kumar: Department of Computer Science and Engineering (CSE), School of Engineering and Technology (SOET), CMR University, Bengaluru, Karnataka, India

DOI: https://doi.org/10.1109/ACCESS.2024.3354277
Journal volume & issue: Vol. 12
pp. 12209 – 12228

Abstract

Read online

According to a report published by Gartner in 2021, a significant portion of Machine Learning (ML) training data will be artificially generated. This development has led to the emergence of various synthetic data generators (SDGs), particularly those based on Generative Adversarial Networks (GAN). All research endeavors so far have been exploratory, focused on specific objectives such as validating utility or disclosure control or assessing how generators can decrease or increase inherent bias with differential privacy. Hence, we aim to empirically identify an AI-based, data generator that can produce datasets that closely resemble real datasets, while also determining the hyper-parameters that enable a satisfactory balance between utility, privacy, and fairness in the datasets. To achieve this, we utilize the Synthetic Data Vault, Data Synthesizer, and Smartnoise-synth, which are three synthetic data generation packages that are accessible via Python. Different data generation models available within the package are presented with 13 tabular datasets iteratively as sample inputs to generate synthetic data. We generated synthetic data using every dataset and generator and investigated the goodness of the generator using five hypothetical scenarios. The utility and privacy offered by the generated data were compared with those of real data. The fairness in the ML model trained with synthetic data was used as a third metric for evaluation. Finally, we employ synthetic data to train regression and classification Machine Learning (ML) algorithms and evaluate their performance. After conducting experiments, analyzing metrics, and comparing ML scores across all 11 generators, we determined that the CTGAN from SDV and PATECTGAN from the SN-synth package were the most effective in mimicking real data for all 13 datasets utilized in our research.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords