IEEE Access (Jan 2023)
A Deep Learning-Based Pipeline for the Generation of Synthetic Tabular Data
Abstract
The recent and rapid progresses in Machine Learning (ML) tools and methodologies paved the way for an accessible market of ML services. In principle, small and medium-sized enterprises, as well as big companies, could act as providers and consumers of services, resulting in an intense exchange of ML services where a consumer may ask many providers for a service preview based on its particular business case, that is, its data. In practice, however, many potential service consumers are reluctant to release their data, when seeking for ML services, because of privacy or intellectual property concerns. As a consequence, the market of ML services is not as fluid as it could be. An alternative to providing real data when looking for an ML service consists in generating and releasing synthetic data. The synthetic data should 1) allow the service provider to preview an ML service whose performance is predictive of the one the same service will achieve on the real data; and 2) prevent the disclosure of the real data. In this paper, we propose a data synthesis technique tailored to a family of very relevant business cases: supervised and unsupervised learning on single-table datasets and relational datasets. Our technique is based on generative deep learning models and we instantiate it in three variants: standard Variational Autoencoders (VAEs), $\beta $ -VAEs, and Introspective VAEs. We experimentally evaluate the two variants to measure the degree to which they meet the two requirements above, using several performance indexes that capture different aspects of the quality of the generated data. The results suggest that data synthesis is a practical answer to the need of decoupling ML service providers and consumers and, hence, can favor the arising of an active and accessible market of ML services.
Keywords