IEEE Access (Jan 2024)
On the Effectiveness of Feature Selection Techniques in the Context of ML-Based Regression Test Prioritization
Abstract
Regression testing is essential for maintaining software functionality in continuous integration (CI) systems, but it can become increasingly costly as software complexity grows. Machine learning-based Regression Test Prioritization (RTP) techniques have been developed to prioritize test cases based on their likelihood of failure, aiming to detect failures early and optimize resource use. However, the features used in the current state-of-the-art for training machine learning (ML) models often vary widely across different datasets, highlighting the need for further research to identify effective feature sets for RTP. Furthermore, the feature selection techniques are frequently biased toward specific features based on the dataset. Hence, we explored an ensemble technique to utilize three ML-based feature selection techniques in this study to identify and refine key features that enhance test case prioritization. These techniques were applied across four tree-based ML models using data from 15 large-scale open-source software projects. Our analysis identified the most compelling features for predicting failures and assessed their impact on RTP. The results showed that using a refined subset of features could achieve similar or up to a 10% increase in RTP performance, using only one-third of the original feature set. We also empirically evaluated the cost considerations when choosing the three methods and reported the ML models’ performance with the refined feature sets. This underscores the potential of integrating advanced feature selection methods into RTP processes.
Keywords