IEEE Access (Jan 2025)
Volatile Organic Compounds for the Prediction of Lung Cancer by Using Ensembled Machine Learning Model and Feature Selection
Abstract
The advancement of biomarkers is critically important at present, as lung cancer is a leading cause of death. In the present study, volatile organic compounds (VOCs) are considered as biomarkers to predict lung cancer. VOCs from seven different sources including breath, blood, urine, cell line, plerual fluid, cancer tissue and lung tissue are targeted to enhance the prediction reliability. Feature selection and models fusion have been focused on during this study. Five in-built and one proposed ensemble machine learning model have been utilised to investigate the different types of VOCs. The idea behind designing one ensemble model is to combine multiple individual models for better performance by using optimal feature sets. This reasoning led to the design of an ensemble model to predict breath VOCs. The AvNNet model has superior performance in predicting blood VOCs, cancer tissue VOCs, cell line VOCs, and urine VOCs compared to four other models, achieving accuracies of 70%, 80%, 70%, and 90% accordingly on the validation dataset. The Blackboost model achieved 90% accuracy on the validation dataset in its prediction of lung tissue VOCs. With 90% accuracy on a validation dataset, the random forest model predicts pleural fluid volatile organic compounds efficiently. When compared to individual models, the proposed ensemble model predicts breath VOCs more effectively and achieves 100% accuracy on the validation dataset.
Keywords