Validating Seed Data Samples for Synthetic Identities &#x2013; Methodology and Uniqueness Metrics

Viktor Varkarakis; Shabab Bazrafkan; Gabriel Costache; Peter Corcoran

doi:10.1109/ACCESS.2020.3016097

IEEE Access (Jan 2020)

Validating Seed Data Samples for Synthetic Identities – Methodology and Uniqueness Metrics

Viktor Varkarakis,
Shabab Bazrafkan,
Gabriel Costache,
Peter Corcoran

Affiliations

Viktor Varkarakis: ORCiD; Department of Electronic Engineering, College of Science and Engineering, National University of Ireland Galway, Galway, Ireland
Shabab Bazrafkan: Department of Physics, Imec Vision Laboratory, University of Antwerp, Antwerp, Belgium
Gabriel Costache: Xperi, Galway, Ireland
Peter Corcoran: ORCiD; Department of Electronic Engineering, College of Science and Engineering, National University of Ireland Galway, Galway, Ireland

DOI: https://doi.org/10.1109/ACCESS.2020.3016097
Journal volume & issue: Vol. 8
pp. 152532 – 152550

Abstract

Read online

This work explores the identity attribute of synthetic face samples derived from Generative Adversarial Networks. The goal is to determine if individual samples are unique in terms of identity, firstly with respect to the seed dataset that trains the GAN model and secondly with respect to other synthetic face samples. Two approaches are introduced to enable the comparative analysis of large sets of synthetic face samples. The first of these uses ROC curves to determine identity uniqueness using a number of large publicly available datasets of real facial samples to provide reference ROCs as a baseline. The second approach uses a thresholding technique utilizing again large publicly available datasets as a reference. For this approach, new metrics are introduced, and a technique is provided to remove the most connected data samples within a large synthetic dataset. The remaining synthetic samples can be considered as unique as data samples gathered from different real individuals. Several StyleGAN models are used to create the synthetic datasets, and variations in key model parameters are explored. It is concluded that the resulting synthetic data samples exhibit excellent uniqueness when compared with the original training dataset, but significantly less uniqueness when comparisons are made within the synthetic dataset. Nevertheless, it is possible to remove the most highly connected synthetic data samples. Thus, in some cases, up to 92% of the data samples in a 20k synthetic dataset can be shown to exhibit similar uniqueness to data samples taken from real public datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords