Scientific Reports (Sep 2024)
Exploring the value of multiple preprocessors and classifiers in constructing models for predicting microsatellite instability status in colorectal cancer
Abstract
Abstract Approximately 15% of patients with colorectal cancer (CRC) exhibit a distinct molecular phenotype known as microsatellite instability (MSI). Accurate and non-invasive prediction of MSI status is crucial for cost savings and guiding clinical treatment strategies. The retrospective study enrolled 307 CRC patients between January 2020 and October 2022. Preoperative images of computed tomography and postoperative status of MSI information were available for analysis. The stratified fivefold cross-validation was used to avoid sample bias in grouping. Feature extraction and model construction were performed as follows: first, inter-/intra-correlation coefficients and the least absolute shrinkage and selection operator algorithm were used to identify the most predictive feature subset. Subsequently, multiple discriminant models were constructed to explore and optimize the combination of six feature preprocessors (Box-Cox, Yeo-Johnson, Max-Abs, Min–Max, Z-score, and Quantile) and three classifiers (logistic regression, support vector machine, and random forest). Selecting the one with the highest average value of the area under the curve (AUC) in the test set as the radiomics model, and the clinical screening model and combined model were also established using the same processing steps as the radiomics model. Finally, the performances of the three models were evaluated and analyzed using decision and correction curves.We observed that the logistic regression model based on the quantile preprocessor had the highest average AUC value in the discriminant models. Additionally, tumor location, the clinical of N stage, and hypertension were identified as independent clinical predictors of MSI status. In the test set, the clinical screening model demonstrated good predictive performance, with the average AUC of 0.762 (95% confidence interval, 0.635–0.890). Furthermore, the combined model showed excellent predictive performance (AUC, 0.958; accuracy, 0.899; sensitivity, 0.929) and favorable clinical applicability and correction effects. The logistic regression model based on the quantile preprocessor exhibited excellent performance and repeatability, which may further reduce the variability of input data and improve the model performance for predicting MSI status in CRC.
Keywords