A Tabular Variational Auto Encoder-Based Hybrid Model for Imbalanced Data Classification With Feature Selection

Asha Abraham; Habeeb Shaik Mohideen; R. Kayalvizhi

doi:10.1109/ACCESS.2023.3329139

IEEE Access (Jan 2023)

A Tabular Variational Auto Encoder-Based Hybrid Model for Imbalanced Data Classification With Feature Selection

Asha Abraham,
Habeeb Shaik Mohideen,
R. Kayalvizhi

Affiliations

Asha Abraham: Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology, Kattankulathur, Chennai, India
Habeeb Shaik Mohideen: Department of Genetic Engineering College of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur, Chennai, India
R. Kayalvizhi: ORCiD; Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology, Kattankulathur, Chennai, India

DOI: https://doi.org/10.1109/ACCESS.2023.3329139
Journal volume & issue: Vol. 11
pp. 122760 – 122771

Abstract

Read online

Cancer is the deadliest disease in humankind. Ovarian Cancer (OC) is important among female-specific cancers. Epithelial Ovarian Cancer (EOC) is the most commonly occurring subtype of OC. The disease is identified in later stages due to the unrevealed symptoms in the early stages. Gene Expression experiments and machine learning (ML) methodologies can lead to preventive care of OC. This can be achieved by identifying malignant gene transformations earlier and using precision medicine that aids in fast recovery. The proposed hybrid Tabular Variational Auto Encoder oriented dictionary based Stratified K Fold Cross Validation (TVAE_dict_SKCV) is an effective model to handle the threat. The main objective is to assess the significance of EOC screening variables for categorizing high-risk patients. It initially generated synthetic data using the TVAE model to increase the EOC subtype data size from the Cancer Cell Line Encyclopedia. The synthesized data were balanced utilizing the Synthetic Minority Oversampling Technique. Significant features were selected with the Boruta Feature Selection method. The HYPERPARAMETERS were fine-tuned employing Optuna optimizer and applied enhanced SKCV with Random Forest classifier. The TVAE_dict_SKCV method with Boruta acquired an accuracy of 98.5 % and outperformed the experiment with Lasso Feature Selection and with original data. Shapley Additive explanations summarize the main features which classify. Optuna efficiently reduced the computing time compared to the Grid Search Cross Validation optimizer.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords