Journal of Statistical Software (Oct 2019)

Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data

  • Daniel Conn,
  • Tuck Ngun,
  • Gang Li,
  • Christina M. Ramirez

DOI
https://doi.org/10.18637/jss.v091.i09
Journal volume & issue
Vol. 91, no. 1
pp. 1 – 25

Abstract

Read online

In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking the importance of features in high-dimensional classification and regression problems. Fuzzy forests is specifically designed to provide relatively unbiased rankings of variable importance in the presence of highly correlated features, especially when the number of features, p, is much larger than the sample size, n (p n). We introduce our implementation of fuzzy forests in the R package, fuzzyforest. Fuzzy forests works by taking advantage of the network structure between features. First, the features are partitioned into separate modules such that the correlation within modules is high and the correlation between modules is low. The package fuzzyforest allows for easy use of the package WGCNA (weighted gene coexpression network analysis, alternatively known as weighted correlation network analysis) to form modules of features such that the modules are roughly uncorrelated. Then recursive feature elimination random forests (RFE-RFs) are used on each module, separately. From the surviving features, a final group is selected and ranked using one last round of RFE-RFs. This procedure results in a ranked variable importance list whose size is pre-specified by the user. The selected features can then be used to construct a predictive model.

Keywords