IEEE Access (Jan 2022)

Tackling Dataset Bias With an Automated Collection of Real-World Samples

  • Vasileios Sevetlidis,
  • George Pavlidis,
  • Spyridon Mouroutsos,
  • Antonios Gasteratos

DOI
https://doi.org/10.1109/ACCESS.2022.3226517
Journal volume & issue
Vol. 10
pp. 126832 – 126844

Abstract

Read online

The early 21st-century technological advancements tilted the scales towards data-driven learning. Thus, modern machine-learning systems rely heavily on data to learn complex models to efficiently provide relevant predictions. Data-driven learning suffers from overfitting, a situation in which the learning process seems to have converged into a model that, unfortunately, lacks generalization power. One way to withstand overfitting is to expand the training dataset with more diverse samples. Typically, this is implemented (particularly in computer vision research, which is of interest in this study) by multiplying the original sample using several transformations. Although this strategy might seem straightforward, it does not affect any preexisting dataset bias because the initial distribution remains more or less similar. Ideally, new samples of unseen data must be found, but the cost of acquiring them individually is high. This study presents a novel pipeline that combines state-of-the-art modules to automatically create new thematic datasets with low bias. The proposed method was able to acquire and allocate more than 880K previously unseen images to produce a data collection, that InceptionV3 classified it with 72% accuracy and achieved 0.0008 performance variance when testing on similar datasets.

Keywords