Frontiers in Big Data (Mar 2025)
Impact of imbalanced features on large datasets
Abstract
The exponential growth of image and video data motivates the need for practical real-time content-based searching algorithms. Features play a vital role in identifying objects within images. However, feature-based classification faces a challenge due to uneven class instance distribution. Ideally, each class should have an equal number of instances and features to ensure optimal classifier performance. However, real-world scenarios often exhibit class imbalances. Thus, this article explores the classification framework based on image features, analyzing balanced and imbalanced distributions. Through extensive experimentation, we examine the impact of class imbalance on image classification performance, primarily on large datasets. The comprehensive evaluation shows that all models perform better with balancing compared to using an imbalanced dataset, underscoring the importance of dataset balancing for model accuracy. Distributed Gaussian (D-GA) and Distributed Poisson (D-PO) are found to be the most effective techniques, especially in improving Random Forest (RF) and SVM models. The deep learning experiments also show an improvement as such.
Keywords