Iterative cleaning and learning of big highly-imbalanced fraud data using unsupervised learning

Robert K. L. Kennedy; Zahra Salekshahrezaee; Flavio Villanustre; Taghi M. Khoshgoftaar

doi:10.1186/s40537-023-00750-3

Journal of Big Data (Jun 2023)

Iterative cleaning and learning of big highly-imbalanced fraud data using unsupervised learning

Robert K. L. Kennedy,
Zahra Salekshahrezaee,
Flavio Villanustre,
Taghi M. Khoshgoftaar

Affiliations

Robert K. L. Kennedy: College of Engineering & Computer Science, Florida Atlantic University
Zahra Salekshahrezaee: College of Engineering & Computer Science, Florida Atlantic University
Flavio Villanustre: LexisNexis Business Information Solutions
Taghi M. Khoshgoftaar: College of Engineering & Computer Science, Florida Atlantic University

DOI: https://doi.org/10.1186/s40537-023-00750-3
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Fraud datasets often times lack consistent and accurate labels, and are characterized by having high class imbalance where the number of fraudulent examples are far fewer than those of normal ones. Machine learning designed for effectively detecting fraud is an important task since fraudulent behavior can have significant financial or health consequences, but is presented with significant challenges due to the class imbalance and availability of reliable labels. This paper presents an unsupervised fraud detection method that uses an iterative cleaning process for effective fraud detection. We measure our method performance using a newly created Medicare fraud big dataset and a widely used credit card fraud dataset. Additionally, we detail the process of creating the highly-imbalanced Medicare dataset from multiple publicly available sources, how additional trainable features were added, and how fraudulent labels were assigned for final model performance measurements. The results are compared with two popular unsupervised learners and show that our method outperforms both models in both datasets. Our work achieves a higher AUPRC with relatively few iterations across both domains.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords