Impact of imbalanced features on large datasets

Waleed Albattah; Rehan Ullah Khan

doi:10.3389/fdata.2025.1455442

Frontiers in Big Data (Mar 2025)

Impact of imbalanced features on large datasets

Waleed Albattah,
Rehan Ullah Khan

Affiliations

Waleed Albattah
Rehan Ullah Khan

DOI: https://doi.org/10.3389/fdata.2025.1455442
Journal volume & issue: Vol. 8

Abstract

Read online

The exponential growth of image and video data motivates the need for practical real-time content-based searching algorithms. Features play a vital role in identifying objects within images. However, feature-based classification faces a challenge due to uneven class instance distribution. Ideally, each class should have an equal number of instances and features to ensure optimal classifier performance. However, real-world scenarios often exhibit class imbalances. Thus, this article explores the classification framework based on image features, analyzing balanced and imbalanced distributions. Through extensive experimentation, we examine the impact of class imbalance on image classification performance, primarily on large datasets. The comprehensive evaluation shows that all models perform better with balancing compared to using an imbalanced dataset, underscoring the importance of dataset balancing for model accuracy. Distributed Gaussian (D-GA) and Distributed Poisson (D-PO) are found to be the most effective techniques, especially in improving Random Forest (RF) and SVM models. The deep learning experiments also show an improvement as such.

Published in Frontiers in Big Data

ISSN: 2624-909X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://www.frontiersin.org/journals/big-data

About the journal

Abstract

Keywords