Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

Erika Cantor; Sandra Guauque-Olarte; Roberto León; Steren Chabert; Rodrigo Salas

doi:10.1186/s13040-024-00388-8

BioData Mining (Sep 2024)

Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

Erika Cantor,
Sandra Guauque-Olarte,
Roberto León,
Steren Chabert,
Rodrigo Salas

Affiliations

Erika Cantor: Department of clinical epidemiology and biostatistics, Pontificia Universidad Javeriana
Sandra Guauque-Olarte: Department of basic sciences and oral medicine, Universidad Nacional de Colombia
Roberto León: Department of Computer Science, Universidad Técnica Federico Santa María
Steren Chabert: School of Biomedical Engineering, Universidad de Valparaiso
Rodrigo Salas: School of Biomedical Engineering, Universidad de Valparaiso

DOI: https://doi.org/10.1186/s13040-024-00388-8
Journal volume & issue: Vol. 17, no. 1
pp. 1 – 17

Abstract

Read online

Abstract The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n \le 30)$$ ( n ≤ 30 ) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.

Published in BioData Mining

ISSN: 1756-0381 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Analysis
Website: https://biodatamining.biomedcentral.com/

About the journal

Abstract

Keywords