IEEE Access (Jan 2025)
Enhancing Gene Mutation Prediction With Sparse Regularized Autoencoders in Lung Cancer Radiomics Analysis
Abstract
Non-small cell lung cancer (NSCLC) remains the major contributor to global deaths annually. A slow and long screening and diagnostic process may lower 5-year survival rates. In cancer precision medicine, radiomics and machine learning are emerging as paradigm shifts in complementing cancer experts. However, these solutions usually suffer from small sample sizes, high dimensional radiomics features, and class imbalance issues inherently. As a unified approach to address all the above problems, sparse regularized Autoencoders with Kullback-Leibler (GSRA-KL) divergence capable of augmenting radiomics data alongside inherent dimension reduction are proposed. It incorporates an additional Kullback-Leibler (KL) divergence term to the cost function of generic sparse regularized Autoencoders (GSRA) for preserving input data distribution at the output layer. Experiments for data augmentation improved the sample size to 430 using an 18F-FDG metastasis radiomics dataset of 43 NSCLC patients with known EGFR mutations. In empirical quality evaluation using resemblance and utility dimension metrics, the GSRA-KL-based approach surpassed GSRA while exhibiting competitive performance against the state-of-the-art deep learning approaches. In downstream EGFR gene mutation prediction enhancement, two multilayer perceptron (MLP) models (Model1 and Model2) with the same control setup were trained using augmented data from GSRA and GSRA-KL, respectively. The testing performance difference between Model2 and Model1 was observed as (Accuracy-18%, area under the receiver operating characteristics curve (AUC)-23%, Precision-30%, Recall-5%, and F1-score-17%). Given the ablation-like characteristics of GSRA-KL compared to GSRA, this performance improvement was probably caused directly by the higher-quality augmented data of GSRA-KL. Additionally, GSRA-KL demonstrated competitive performance in gene mutation prediction enhancement compared to deep learning techniques, confirming its efficacy. We expect to effectively identify NSCLC mutations in scenarios like small samples, imbalanced class distributions, and large dimensions for precision medicine treatments and 5-year survival improvements.
Keywords