Data reduction techniques for highly imbalanced medicare Big Data

John T. Hancock; Huanjing Wang; Taghi M. Khoshgoftaar; Qianxin Liang

doi:10.1186/s40537-023-00869-3

Journal of Big Data (Jan 2024)

Data reduction techniques for highly imbalanced medicare Big Data

John T. Hancock,
Huanjing Wang,
Taghi M. Khoshgoftaar,
Qianxin Liang

Affiliations

John T. Hancock: College of Engineering and Computer Science, Florida Atlantic University
Huanjing Wang: Ogden College of Science and Engineering, Western Kentucky University
Taghi M. Khoshgoftaar: College of Engineering and Computer Science, Florida Atlantic University
Qianxin Liang: College of Engineering and Computer Science, Florida Atlantic University

DOI: https://doi.org/10.1186/s40537-023-00869-3
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 41

Abstract

Read online

Abstract In the domain of Medicare insurance fraud detection, handling imbalanced Big Data and high dimensionality remains a significant challenge. This study assesses the combined efficacy of two data reduction techniques: Random Undersampling (RUS), and a novel ensemble supervised feature selection method. The techniques are applied to optimize Machine Learning models for fraud identification in the classification of highly imbalanced Big Medicare Data. Utilizing two datasets from The Centers for Medicare & Medicaid Services (CMS) labeled by the List of Excluded Individuals/Entities (LEIE), our principal contribution lies in empirically demonstrating that data reduction techniques applied to these datasets significantly improves classification performance. The study employs a systematic experimental design to investigate various scenarios, ranging from using each technique in isolation to employing them in combination. The results indicate that a synergistic application of both techniques outperforms models that utilize all available features and data. Moreover, reduction in the number of features leads to more explainable models. Given the enormous financial implications of Medicare fraud, our findings not only offer computational advantages but also significantly enhance the effectiveness of fraud detection systems, thereby having the potential to improve healthcare services.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords