BMC Medical Informatics and Decision Making (May 2025)
Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome
Abstract
Abstract Background and objective This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications. Methods A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance. Results SEQENS identified different relevant features for each prediction horizon, with Age, TP53, − 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year). Conclusions The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.
Keywords