BioData Mining (Sep 2024)

Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

  • Erika Cantor,
  • Sandra Guauque-Olarte,
  • Roberto León,
  • Steren Chabert,
  • Rodrigo Salas

DOI
https://doi.org/10.1186/s13040-024-00388-8
Journal volume & issue
Vol. 17, no. 1
pp. 1 – 17

Abstract

Read online

Abstract The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n \le 30)$$ ( n ≤ 30 ) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.

Keywords