BMC Bioinformatics (Mar 2022)

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

  • Michael Lau,
  • Claudia Wigmann,
  • Sara Kress,
  • Tamara Schikowski,
  • Holger Schwender

DOI
https://doi.org/10.1186/s12859-022-04634-w
Journal volume & issue
Vol. 23, no. 1
pp. 1 – 30

Abstract

Read online

Abstract Background Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS. Results In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results. Conclusions When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted.

Keywords