Journal of Information Sciences (Jul 2024)
Performance Assessment of Ensemble-Tree Learning Models on Breast Cancer Dataset
Abstract
Advancements of feature extraction enable the collection of prognostic data values which can be used to distinguish between benign and malignant tumours. While single learning models are capable of making predictions, combining weak learners to form an ensemble can improve predictive performance. This study evaluates and compares the performance of a few selected ensemble-tree machine learning models as applied to a Wisconsin Diagnostic breast cancer (WDBC) dataset. The dataset is split, producing a 60% training and 40% test division set. Random Forest classifier, Extremely Randomized Trees classifier, Gradient Boosting machine classifier and Extreme Gradient Boosting classifier were initialized with 3 weak learners and fit to the training set, with subsequent predictions made on the test set. Evaluation metrics used include Accuracy, Area under Receiver Operating Characteristic curves (AUROC), Precision-Recall curves and F2 scores followed by a Stratified 5-fold cross-validation procedure. Taking Precision and Recall into higher consideration, Extreme Gradient Boosting classifier and Extremely Randomized Trees classifier produced better performances with an average accuracy of 0.9386 and 0.9460 respectively. Overall, the Extremely Randomized Trees classifier outperforms the rest of the models with an average F2 score of 0.4232. Keywords: Breast cancer; Classification models; Tree-based Ensemble; Supervised learning
Keywords