Improved classification of large imbalanced data sets using rationalized technique: Updated Class Purity Maximization Over_Sampling Technique (UCPMOT)

Sachin S. Patil; Shefali P. Sonavane

doi:10.1186/s40537-017-0108-1

Journal of Big Data (Dec 2017)

Improved classification of large imbalanced data sets using rationalized technique: Updated Class Purity Maximization Over_Sampling Technique (UCPMOT)

Sachin S. Patil,
Shefali P. Sonavane

Affiliations

Sachin S. Patil: Computer Science and Engineering, R.I.T, Rajaramnagar
Shefali P. Sonavane: Information Technology, W.C.E., Vishrambag

DOI: https://doi.org/10.1186/s40537-017-0108-1
Journal volume & issue: Vol. 4, no. 1
pp. 1 – 32

Abstract

Read online

Abstract The huge variety of NoSQL Big Data has tossed a need for new pathways to store, process and analyze it. The quantum of data created is inconceivable along with a mixed breath of unknown veracity and creative visualization. The new trials of frameworks help to find substantial unidentified values from massive data sets. They have added an exceptional dimension to the pre-processing and contextual conversion of the data sets for needful analysis. In addition, handling of ambitious imbalanced data sets has acknowledged an intimation of alarm. Traditional classifiers are unable to discourse the precise need of grouping for such data sets. Over_sampling of the minority classes help to improve the performance. Updated Class Purity Maximization Over_Sampling Technique (UCPMOT) is a rationalized technique proposed to handle imbalanced data sets using exclusive safe-level based synthetic sample creation. It addresses the multi-class problem in alignment to a newly induced method namely lowest versus highest. The projected technique experiments with several data sets from the UCI repository. The underlying bed of mapreduce environment encompasses the distributed processing approach on Apache Hadoop framework. Several classifiers help to authorize the classification results using parameters like F-measure and AUC values. The experimental conclusions quote the dominance of UCPMOT over the benchmarking techniques.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords