IEEE Access (Jan 2024)
Improved Software Effort Estimation Through Machine Learning: Challenges, Applications, and Feature Importance Analysis
Abstract
Effort estimations are a crucial aspect of software development. The tasks should be completed before the start of any software project. Accurate estimations increase the chances of project success, and inaccurate information can lead to severe issues. This study systematically reviewed the literature on effort-estimating models from 2015-2024, identifying 69 relevant studies from various publications to compile information on various software work estimation models. This review aims to analyze the models proposed in the literature and their classification, the metrics used for accuracy measurement, the leading model that has been chiefly applied for effort estimation, and the benchmark datasets available. The study utilized 542 relevant articles on software development, cost, effort, prediction, estimation, and modelling techniques in the search strategy. After 194 selections, the authors chose 69 articles to understand ML applications in SEE comprehensively. The researchers used a scoring system to assess each study’s responses (from 0 to 5 points) to their research questions. This helped them identify credible studies with higher scores for a comprehensive review aligned with its objectives. The data extraction process identified 91% (63) of 69 studies as either highly or somewhat relevant, demonstrating a successful search strategy for analysis. The literature review on SEE indicates a growing preference for ML-based models in 59% of selected studies. 17% of the studies chosen favor hybrid models to overcome software development challenges. We qualitatively analyzed all the literature on software effort estimation using expert judgment, formal estimation techniques, ML-based techniques, and hybrid techniques. We discovered that researchers have frequently used ML-based models to estimate software effort and are currently in the lead. This study also explores the application of feature importance and selection in machine learning models for Software Effort Estimation (SEE) using popular algorithms like support Vector Machine (SVM), AdaBoost (AB), Gradient Boost (GB), and Random Forest (RF) with six benchmark datasets like CHINA, COCOMO-NASA2, COCOMO, COCOMO81, DESHARNAIS, and KITCHENHAM. We analyze the dataset descriptions and feature importance of the dataset analysis using ML models for choosing crucial play attributes in SEE.
Keywords