A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data

Raquel Sánchez-Marqués; Vicente García; J. Salvador Sánchez

doi:10.1038/s41598-024-68291-0

Scientific Reports (Jul 2024)

A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data

Raquel Sánchez-Marqués,
Vicente García,
J. Salvador Sánchez

Affiliations

Raquel Sánchez-Marqués: Fundación Estatal, Salud, Infancia y Bienestar Social
Vicente García: Dept. Electrical and Computer Engineering, Instituto de Ingeniería y Tecnología, Universidad Autónoma de Ciudad Juárez
J. Salvador Sánchez: Dept. Computer Languages and Systems, Institute of New Imaging Technologies, Universitat Jaume I

DOI: https://doi.org/10.1038/s41598-024-68291-0
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Accurate prediction and grading of gliomas play a crucial role in evaluating brain tumor progression, assessing overall prognosis, and treatment planning. In addition to neuroimaging techniques, identifying molecular biomarkers that can guide the diagnosis, prognosis and prediction of the response to therapy has aroused the interest of researchers in their use together with machine learning and deep learning models. Most of the research in this field has been model-centric, meaning it has been based on finding better performing algorithms. However, in practice, improving data quality can result in a better model. This study investigates a data-centric machine learning approach to determine their potential benefits in predicting glioma grades. We report six performance metrics to provide a complete picture of model performance. Experimental results indicate that standardization and oversizing the minority class increase the prediction performance of four popular machine learning models and two classifier ensembles applied on a low-imbalanced data set consisting of clinical factors and molecular biomarkers. The experiments also show that the two classifier ensembles significantly outperform three of the four standard prediction models. Furthermore, we conduct a comprehensive descriptive analysis of the glioma data set to identify relevant statistical characteristics and discover the most informative attributes using four feature ranking algorithms.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords