IEEE Access (Jan 2020)

Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools

  • Sandro De Paula Mendonca,
  • Yvan Pereira Dos Santos Brito,
  • Carlos Gustavo Resque Dos Santos,
  • Rodrigo Do Amor Divino Lima,
  • Tiago Davi Oliveira De Araujo,
  • Bianchi Serique Meiguins

DOI
https://doi.org/10.1109/ACCESS.2020.2991949
Journal volume & issue
Vol. 8
pp. 82917 – 82928

Abstract

Read online

Data generators are applications that produce synthetic datasets, which are useful for testing data analytics applications, such as machine learning algorithms and information visualization techniques. Each data generator application has a different approach to generate data. Consequently, each one has functionality gaps that make it unsuitable for some tasks (e.g., lack of ways to create outliers and non-random noise). This paper presents a data generator application that aims to fill relevant gaps scattered across other applications, providing a flexible tool to assist researchers in exhaustively testing their techniques in more diverse ways. The proposed system allows users to define and compose known statistical distributions to produce the desired outcome, visualizing the behavior of the data in real-time to analyze if it has the characteristics needed for efficient testing. This paper presents in detail the tool functionalities and how to create datasets, as well as a usage scenario to illustrate the process of data creation.

Keywords