binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

Samir Rachid Zaim; Colleen Kenost; Joanne Berghout; Wesley Chiu; Liam Wilson; Hao Helen Zhang; Yves A. Lussier

doi:10.1186/s12859-020-03718-9

BMC Bioinformatics (Aug 2020)

binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions

Samir Rachid Zaim,
Colleen Kenost,
Joanne Berghout,
Wesley Chiu,
Liam Wilson,
Hao Helen Zhang,
Yves A. Lussier

Affiliations

Samir Rachid Zaim: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences
Colleen Kenost: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences
Joanne Berghout: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences
Wesley Chiu: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences
Liam Wilson: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences
Hao Helen Zhang: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences
Yves A. Lussier: Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences

DOI: https://doi.org/10.1186/s12859-020-03718-9
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 22

Abstract

Read online

Abstract Background In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the “P > > N” high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. Results In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. Conclusion binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal