Engineering and Applied Science Research (Nov 2023)

Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction

  • Lalu Ganda Rady Putra,
  • Khairani Marzuki,
  • Hairani Hairani

Journal volume & issue
Vol. 50, no. 6
pp. 577 – 583

Abstract

Read online

Indonesia is an archipelago with the fourth largest population in the world, with a population of 283 million. In Indonesia, breast cancer ranks first in cancer and is the highest contributor to death. Deaths caused by breast cancer can be minimized by screening and early detection to avoid the risk of more severe cancer. Early detection of breast cancer can delay the growth of cancer cells and increase the chances of recovery. This research proposed a machine learning-based application for screening and early detection of breast cancer independently based on perceived symptoms. However, developing breast cancer early detection applications requires a very high level of accuracy to minimize prediction errors. This research focused on finding the optimal accuracy of the machine learning method so that it could predict breast cancer with a very low error rate. This research aimed to improve the performance of classification methods in breast cancer disease prediction using the correlation feature selection approach and hybrid sampling Smote-Tomek Link. This research utilized Support Vector Machine (SVM) and Naive Bayes classification methods with a combination of Smote-Tomek Link hybrid sampling approach and correlation feature selection. Hybrid Sampling Smote-Tomek Link balanced the data by minimizing noise in the data created. At the same time, the correlation feature selection method was used to select relevant or influential attributes with class attributes based on a strong correlation level (≥ 0.6) between input attributes and classes. The results of this study obtained that the SVM method with hybrid sampling and correlation feature selection obtained the best performance compared to the Naive Bayes method and previous research referred to with an accuracy of 96.80%, sensitivity of 96.80%, and specificity of 96.80%. Thus, using the Smote-Tomek Link hybrid sampling approach and correlation feature selection positively impacted the performance increase in the SVM and Naive Bayes methods for breast cancer prediction.

Keywords