IEEE Access (Jan 2023)

Analysis of Feature Selection Methods in Software Defect Prediction Models

  • Misbah Ali,
  • Tehseen Mazhar,
  • Tariq Shahzad,
  • Yazeed Yasin Ghadi,
  • Syed Muhammad Mohsin,
  • Syed Muhammad Abrar Akber,
  • Mohammed Ali

DOI
https://doi.org/10.1109/ACCESS.2023.3343249
Journal volume & issue
Vol. 11
pp. 145954 – 145974

Abstract

Read online

Improving software quality by proactively detecting potential defects during development is a major goal of software engineering. Software defect prediction plays a central role in achieving this goal. The power of data analytics and machine learning allows us to focus our efforts where they are needed most. A key factor in the success of software fault prediction is selecting relevant features and reducing data dimensionality. Feature selection methods contribute by filtering out the most critical attributes from a plethora of potential features. These methods have the potential to significantly improve the accuracy and efficiency of fault prediction models. However, the field of feature selection in the context of software fault prediction is vast and constantly evolving, with a variety of techniques and tools available. Based on these considerations, our systematic literature review conducts a comprehensive investigation of feature selection methods used in the context of software fault prediction. The research uses a refined search strategy involving four reputable digital libraries, including IEEE Explore, Science Direct, ACM Digital Library, and Springer Link, to provide a comprehensive and exhaustive review through a rigorous analysis of 49 selected primary studies from 2014. The results highlight several important issues. First, there is a prevalence of filtering and hybrid feature selection methods. Second, single classifiers such as Naïve Bayes, Support Vector Machine, and Decision Tree, as well as ensemble classifiers such as Random Forest, Bagging, and AdaBoost are commonly used. Third, evaluation metrics such as area under the curve, accuracy, and F-measure are commonly used for performance evaluation. Finally, there is a clear preference for tools such as WEKA, MATLAB, and Python. By providing insights into current trends and practices in the field, this study offers valuable guidance to researchers and practitioners to make informed decisions to improve software fault prediction models and contribute to the overall improvement of software quality.

Keywords