IEEE Access (Jan 2024)
Rethinking Deep CNN Training: A Novel Approach for Quality-Aware Dataset Optimization
Abstract
The informativeness of data has always been of great interest within the machine learning community. Nowadays, with the skyrocketing advancement of artificial intelligence and massive volumes of noisy data, it becomes even more essential to develop robust and effective methods for training data optimization. Existing approaches are mostly based on empirical trial and error, with either stochastic or deterministic data reduction strategies. The key limitation of such solutions is that they do not consider the overall informativeness of the resulting training dataset. In this paper, a novel approach for quality-aware dataset optimization by initial assessment of its informativeness is proposed. As a metric of informativeness, entropy values are calculated over the target dataset. To alleviate the computational complexity, an initial clustering of the dataset is performed, and the entropy of each cluster is calculated independently. The dataset is then optimized by dynamic programming to find a sequence of subsets with low overall entropy according to imposed size limitations. The experimental evaluation shows that the proposed approach improves over current best alternatives in terms of accuracy, precision, recall, and F1-score metrics. Moreover, the proposed approach provides excellent interclass discrimination even for a large number of classes.
Keywords