An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets

Thejas G.S.; Yashas Hariprasad; S.S. Iyengar; N.R. Sunitha; Prajwal Badrinath; Shasank Chennupati

Machine Learning with Applications (Jun 2022)

An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets

Thejas G.S.,
Yashas Hariprasad,
S.S. Iyengar,
N.R. Sunitha,
Prajwal Badrinath,
Shasank Chennupati

Affiliations

Thejas G.S.: Tarleton State University, Texas A&M University System, Department of Computer Science and Electrical Engineering, Stephenville, TX, 76401, USA; Corresponding author.
Yashas Hariprasad: Florida International University, Discovery Lab, Knight Foundation School of Computing and Information Sciences, Miami, FL, 33199, USA
S.S. Iyengar: Florida International University, Discovery Lab, Knight Foundation School of Computing and Information Sciences, Miami, FL, 33199, USA
N.R. Sunitha: Siddaganga Institute of Technology, Department of Computer Science and Engineering, Tumakuru, Karnataka, 572103, India
Prajwal Badrinath: Florida International University, Discovery Lab, Knight Foundation School of Computing and Information Sciences, Miami, FL, 33199, USA
Shasank Chennupati: University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC, 27599, USA

Journal volume & issue: Vol. 8
p. 100267

Abstract

Read online

More often than not, data collected in real-time tends to be imbalanced i.e., the samples belonging to a particular class are significantly more than the others. This degrades the performance of the predictor. One of the most notable algorithms to handle such an imbalance in the dataset by fabricating synthetic data, is the “Synthetic Minority Oversampling Technique (SMOTE)”. However, data imbalance is not solely responsible for the poor performance of the classifier. Certain research works have demonstrated that noisy samples can have a significant role in misclassifying the dataset. Also, handling large data is computationally expensive. Hence, data reduction is imperative. In this work, we put forth a novel extension of SMOTE by integrating it with the Kalman filter. The proposed method, Kalman-SMOTE (KSMOTE), filters out the noisy samples in the final dataset after SMOTE, which includes both the raw data and the synthetically generated samples, thereby reducing the size of the dataset. Our model is validated with a wide range of datasets. An experimental analysis of the results shows that our model outperforms the presently available techniques.

Published in Machine Learning with Applications

ISSN: 2666-8270 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General): Cybernetics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/machine-learning-with-applications

About the journal

Abstract

Keywords