International Journal of Information Management Data Insights (Nov 2023)

Comparison of tabular synthetic data generation techniques using propensity and cluster log metric

  • Aryan Pathare,
  • Ramchandra Mangrulkar,
  • Kartik Suvarna,
  • Aryan Parekh,
  • Govind Thakur,
  • Aruna Gawade

Journal volume & issue
Vol. 3, no. 2
p. 100177

Abstract

Read online

In the 21st-century, data is as valuable as gold. Many data-centric applications are generating a vast amount of data. Businesses can use this generated data to pinpoint the various sources of problems, if any. In addition, the data can help enterprises to identify connections between what is happening in different areas, departments, and systems. However, having more data is not enough; the data should also be of high quality. For example, taking action based on unfamiliar evidence, speculative ideas, or observations could lead to the wastage of resources. Whereas using high-quality data will help achieve correct results. Synthetic data is artificially generated data. Synthetic data is generated by an algorithm and used to represent real-world data, test datasets, perform mathematical model validation, and, most importantly, for training of machine learning models. Synthetic data can also be used to preserve data privacy. It is considered a safe way to transfer sensitive data because it creates a transaction database that does not contain any confidential information. This paper compares the tabular synthetic data generation techniques using various datasets, viz. balanced datasets, unbalanced datasets, datasets with numerical attributes only, datasets with categorical attributes only and mixed datasets. The utility of the generated synthetic data is measured using the Propensity score metric and Cluster-Log metric. The main finding of this paper is that the Classification And Regression Tree (CART) model provides the best results for all types of datasets. At the same time, Generative Adversarial Networks (GANs) give subpar or mediocre results at best. This contradicts the common belief that GANs are the go-to models for producing synthetic data.

Keywords