Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region

Zafar Mahmood; Naveed Anwer Butt; Ghani Ur Rehman; Muhammad Zubair; Muhammad Aslam; Afzal Badshah; Syeda Fizzah Jilani

doi:10.3390/app12168371

Applied Sciences (Aug 2022)

Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region

Zafar Mahmood,
Naveed Anwer Butt,
Ghani Ur Rehman,
Muhammad Zubair,
Muhammad Aslam,
Afzal Badshah,
Syeda Fizzah Jilani

Affiliations

Zafar Mahmood: Department of Computer Science, University of Gujrat, Punjab 50700, Pakistan
Naveed Anwer Butt: Department of Computer Science, University of Gujrat, Punjab 50700, Pakistan
Ghani Ur Rehman: Department of Computer Science & Bioinformatics, Khushal Khan Khattak University, Karak 27000, Pakistan
Muhammad Zubair: Department of Computer Science & Bioinformatics, Khushal Khan Khattak University, Karak 27000, Pakistan
Muhammad Aslam: School of Computing Engineering & Physical Sciences, University of West Scotland, Glasgow G72 0LH, UK
Afzal Badshah: Department of Computer Science & Software Engineering, International Islamic University Islamabad, Islamabad 44000, Pakistan
Syeda Fizzah Jilani: Department of Physics, Aberystwyth University, Aberystwyth SY23 3BZ, UK

DOI: https://doi.org/10.3390/app12168371
Journal volume & issue: Vol. 12, no. 16
p. 8371

Abstract

Read online

The classification of imbalanced and overlapping data has provided customary insight over the last decade, as most real-world applications comprise multiple classes with an imbalanced distribution of samples. Samples from different classes overlap near class boundaries, creating a complex structure for the underlying classifier. Due to the imbalanced distribution of samples, the underlying classifier favors samples from the majority class and ignores samples representing the least minority class. The imbalanced nature of the data—resulting in overlapping regions—greatly affects the learning of various machine learning classifiers, as most machine learning classifiers are designed to handle balanced datasets and perform poorly when applied to imbalanced data. To improve learning on multi-class problems, more expertise is required in both traditional classifiers and problem domain datasets. Some experimentation and knowledge of hyper-tuning the parameters and parameters of the classifier under consideration are required. Several techniques for learning from multi-class problems have been reported in the literature, such as sampling techniques, algorithm adaptation methods, transformation methods, hybrid methods, and ensemble techniques. In the current research work, we first analyzed the learning behavior of state-of-the-art ensemble and non-ensemble classifiers on imbalanced and overlapping multi-class data. After analysis, we used grid search techniques to optimize key parameters (by hyper-tuning) of ensemble and non-ensemble classifiers to determine the optimal set of parameters to enhance the learning from a multi-class imbalanced classification problem, performed on 15 public datasets. After hyper-tuning, 20% of the dataset samples are synthetically generated to add to the majority class of each respective dataset to make it more overlapped (complex structure). After the synthetic sample’s addition, the hyper-tuned ensemble and non-ensemble classifiers are tested over that complex structure. This paper also includes a brief description of tuned parameters and their effects on imbalanced data, followed by a detailed comparison of ensemble and non-ensemble classifiers with the default and tuned parameters for both original and synthetically overlapped datasets. We believe that the underlying paper is the first kind of effort in this domain, which will furnish various research aspects to with a greater focus on the parameters of the classifier in the field of learning from imbalanced data problems using machine-learning algorithms.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords