Deep Generative Models for Synthetic Data: A Survey

Peter Eigenschink; Thomas Reutterer; Stefan Vamosi; Ralf Vamosi; Chang Sun; Klaudius Kalcher

doi:10.1109/ACCESS.2023.3275134

IEEE Access (Jan 2023)

Deep Generative Models for Synthetic Data: A Survey

Peter Eigenschink,
Thomas Reutterer,
Stefan Vamosi,
Ralf Vamosi,
Chang Sun,
Klaudius Kalcher

Affiliations

Peter Eigenschink: ORCiD; Department of Marketing, Vienna University of Economics and Business, Vienna, Austria
Thomas Reutterer: ORCiD; Department of Marketing, Vienna University of Economics and Business, Vienna, Austria
Stefan Vamosi: Department of Marketing, Vienna University of Economics and Business, Vienna, Austria
Ralf Vamosi: Department of Marketing, Vienna University of Economics and Business, Vienna, Austria
Chang Sun: ORCiD; Institute of Data Science, Maastricht University, MD Maastricht, The Netherlands
Klaudius Kalcher: Mostly AI GmbH, Vienna, Austria

DOI: https://doi.org/10.1109/ACCESS.2023.3275134
Journal volume & issue: Vol. 11
pp. 47304 – 47320

Abstract

Read online

A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models’ internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords