Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data

Daniel Conn; Tuck Ngun; Gang Li; Christina M. Ramirez

doi:10.18637/jss.v091.i09

Journal of Statistical Software (Oct 2019)

Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data

Daniel Conn,
Tuck Ngun,
Gang Li,
Christina M. Ramirez

Affiliations

Daniel Conn
Tuck Ngun
Gang Li
Christina M. Ramirez

DOI: https://doi.org/10.18637/jss.v091.i09
Journal volume & issue: Vol. 91, no. 1
pp. 1 – 25

Abstract

Read online

In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking the importance of features in high-dimensional classification and regression problems. Fuzzy forests is specifically designed to provide relatively unbiased rankings of variable importance in the presence of highly correlated features, especially when the number of features, p, is much larger than the sample size, n (p n). We introduce our implementation of fuzzy forests in the R package, fuzzyforest. Fuzzy forests works by taking advantage of the network structure between features. First, the features are partitioned into separate modules such that the correlation within modules is high and the correlation between modules is low. The package fuzzyforest allows for easy use of the package WGCNA (weighted gene coexpression network analysis, alternatively known as weighted correlation network analysis) to form modules of features such that the modules are roughly uncorrelated. Then recursive feature elimination random forests (RFE-RFs) are used on each module, separately. From the surviving features, a final group is selected and ranked using one last round of RFE-RFs. This procedure results in a ranked variable importance list whose size is pre-specified by the user. The selected features can then be used to construct a predictive model.

Published in Journal of Statistical Software

ISSN: 1548-7660 (Online)
Publisher: Foundation for Open Access Statistics
Country of publisher: United States
LCC subjects: Social Sciences: Statistics
Website: http://www.jstatsoft.org/

About the journal

Abstract

Keywords