A survey on addressing high-class imbalance in big data

Joffrey L. Leevy; Taghi M. Khoshgoftaar; Richard A. Bauder; Naeem Seliya

doi:10.1186/s40537-018-0151-6

Journal of Big Data (Nov 2018)

A survey on addressing high-class imbalance in big data

Joffrey L. Leevy,
Taghi M. Khoshgoftaar,
Richard A. Bauder,
Naeem Seliya

Affiliations

Joffrey L. Leevy: Florida Atlantic University
Taghi M. Khoshgoftaar: Florida Atlantic University
Richard A. Bauder: Florida Atlantic University
Naeem Seliya: Ohio Northern University

DOI: https://doi.org/10.1186/s40537-018-0151-6
Journal volume & issue: Vol. 5, no. 1
pp. 1 – 30

Abstract

Read online

Abstract In a majority–minority classification problem, class imbalance in the dataset(s) can dramatically skew the performance of classifiers, introducing a prediction bias for the majority class. Assuming the positive (minority) class is the group of interest and the given application domain dictates that a false negative is much costlier than a false positive, a negative (majority) class prediction bias could have adverse consequences. With big data, the mitigation of class imbalance poses an even greater challenge because of the varied and complex structure of the relatively much larger datasets. This paper provides a large survey of published studies within the last 8 years, focusing on high-class imbalance (i.e., a majority-to-minority class ratio between 100:1 and 10,000:1) in big data in order to assess the state-of-the-art in addressing adverse effects due to class imbalance. In this paper, two techniques are covered which include Data-Level (e.g., data sampling) and Algorithm-Level (e.g., cost-sensitive and hybrid/ensemble) Methods. Data sampling methods are popular in addressing class imbalance, with Random Over-Sampling methods generally showing better overall results. At the Algorithm-Level, there are some outstanding performers. Yet, in the published studies, there are inconsistent and conflicting results, coupled with a limited scope in evaluated techniques, indicating the need for more comprehensive, comparative studies.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords