Unsupervised label generation for severely imbalanced fraud data

Mary Anne Walauskis; Taghi M. Khoshgoftaar

doi:10.1186/s40537-025-01120-x

Journal of Big Data (Mar 2025)

Unsupervised label generation for severely imbalanced fraud data

Mary Anne Walauskis,
Taghi M. Khoshgoftaar

Affiliations

Mary Anne Walauskis: College of Engineering and Computer Science, Florida Atlantic University
Taghi M. Khoshgoftaar: College of Engineering and Computer Science, Florida Atlantic University

DOI: https://doi.org/10.1186/s40537-025-01120-x
Journal volume & issue: Vol. 12, no. 1
pp. 1 – 23

Abstract

Read online

Abstract Many datasets remain unlabeled as obtaining labeled data for machine learning is frequently expensive and necessitates a high level of domain expertise. Another challenge facing machine learning practitioners is class imbalance. In domains such as fraud detection, overcoming significant class imbalance presents an additional difficulty, as seen in the Credit Card Fraud and Medicare Part D claims datasets used in this work. Our novel binary labeling method automates the labeling process, with minimal expert input, using the combination of an ensemble unsupervised method with a percentile thresholding technique. The labels are further refined through an iterative minimization process that selects only the highest-confidence instances to receive a final labeling of fraudulent. Our labeling approach successfully overcomes the challenge of generating labels for severely imbalanced data, labeling instances as fraudulent or not, in an entirely unsupervised framework. Additionally, and in contrast to conventional methods, our methodology provides a more efficient evaluation by directly assessing the generated labels’ efficacy without requiring the training of a supervised classifier to evaluate the labels. In order to examine the effect on label efficacy, we report results across a range of positive instance levels for each dataset. The quality of the newly generated class labels is thoroughly assessed using three evaluation metrics: Jaccard Index (JI), Precision, and Matthews Correlation Coefficient (MCC). Our empirical results demonstrate our approach consistently outperforms the baseline, Isolation Forest (IF), for all positive instance levels and metrics. Our novel methodology demonstrates the ability to provide accurate and robust labels and overcome the challenge of class imbalance, which could result in better machine learning applications in highly imbalanced domains and more efficient evaluation of newly generated class labels.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords