Rethinking Deep CNN Training: A Novel Approach for Quality-Aware Dataset Optimization

Bohdan Rusyn; Oleksiy Lutsyk; Rostyslav Kosarevych; Oleg Kapshii; Oleksandr Karpin; Taras Maksymyuk; Juraj Gazda

doi:10.1109/ACCESS.2024.3414651

IEEE Access (Jan 2024)

Rethinking Deep CNN Training: A Novel Approach for Quality-Aware Dataset Optimization

Bohdan Rusyn,
Oleksiy Lutsyk,
Rostyslav Kosarevych,
Oleg Kapshii,
Oleksandr Karpin,
Taras Maksymyuk,
Juraj Gazda

Affiliations

Bohdan Rusyn: ORCiD; Department of Remote Sensing Information Technologies, Karpenko Physico-Mechanical Institute, NAS of Ukraine, Lviv, Ukraine
Oleksiy Lutsyk: ORCiD; Department of Remote Sensing Information Technologies, Karpenko Physico-Mechanical Institute, NAS of Ukraine, Lviv, Ukraine
Rostyslav Kosarevych: ORCiD; Department of Remote Sensing Information Technologies, Karpenko Physico-Mechanical Institute, NAS of Ukraine, Lviv, Ukraine
Oleg Kapshii: ORCiD; Advanced Systems Research Group, Infineon Technologies, Lviv, Ukraine
Oleksandr Karpin: Advanced Systems Research Group, Infineon Technologies, Lviv, Ukraine
Taras Maksymyuk: ORCiD; Advanced Systems Research Group, Infineon Technologies, Lviv, Ukraine
Juraj Gazda: ORCiD; Department of Computers and Informatics, Technical University of Kosice, Košice, Slovakia

DOI: https://doi.org/10.1109/ACCESS.2024.3414651
Journal volume & issue: Vol. 12
pp. 137427 – 137438

Abstract

Read online

The informativeness of data has always been of great interest within the machine learning community. Nowadays, with the skyrocketing advancement of artificial intelligence and massive volumes of noisy data, it becomes even more essential to develop robust and effective methods for training data optimization. Existing approaches are mostly based on empirical trial and error, with either stochastic or deterministic data reduction strategies. The key limitation of such solutions is that they do not consider the overall informativeness of the resulting training dataset. In this paper, a novel approach for quality-aware dataset optimization by initial assessment of its informativeness is proposed. As a metric of informativeness, entropy values are calculated over the target dataset. To alleviate the computational complexity, an initial clustering of the dataset is performed, and the entropy of each cluster is calculated independently. The dataset is then optimized by dynamic programming to find a sequence of subsets with low overall entropy according to imposed size limitations. The experimental evaluation shows that the proposed approach improves over current best alternatives in terms of accuracy, precision, recall, and F1-score metrics. Moreover, the proposed approach provides excellent interclass discrimination even for a large number of classes.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords