BMC Medical Research Methodology (Jun 2023)

Use and misuse of random forest variable importance metrics in medicine: demonstrations through incident stroke prediction

  • Meredith L. Wallace,
  • Lucas Mentch,
  • Bradley J. Wheeler,
  • Amanda L. Tapia,
  • Marc Richards,
  • Siyu Zhou,
  • Lixia Yi,
  • Susan Redline,
  • Daniel J. Buysse

DOI
https://doi.org/10.1186/s12874-023-01965-x
Journal volume & issue
Vol. 23, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Background Machine learning tools such as random forests provide important opportunities for modeling large, complex modern data generated in medicine. Unfortunately, when it comes to understanding why machine learning models are predictive, applied research continues to rely on ‘out of bag’ (OOB) variable importance metrics (VIMPs) that are known to have considerable shortcomings within the statistics community. After explaining the limitations of OOB VIMPs – including bias towards correlated features and limited interpretability – we describe a modern approach called ‘knockoff VIMPs’ and explain its advantages. Methods We first evaluate current VIMP practices through an in-depth literature review of 50 recent random forest manuscripts. Next, we recommend organized and interpretable strategies for analysis with knockoff VIMPs, including computing them for groups of features and considering multiple model performance metrics. To demonstrate methods, we develop a random forest to predict 5-year incident stroke in the Sleep Heart Health Study and compare results based on OOB and knockoff VIMPs. Results Nearly all papers in the literature review contained substantial limitations in their use of VIMPs. In our demonstration, using OOB VIMPs for individual variables suggested two highly correlated lung function variables (forced expiratory volume, forced vital capacity) as the best predictors of incident stroke, followed by age and height. Using an organized analytic approach that considered knockoff VIMPs of both groups of features and individual features, the largest contributions to model sensitivity were medications (especially cardiovascular) and measured medical risk factors, while the largest contributions to model specificity were age, diastolic blood pressure, self-reported medical risk factors, polysomnography features, and pack-years of smoking. Thus, we reach very different conclusions about stroke risk factors using OOB VIMPs versus knockoff VIMPs. Conclusions The near-ubiquitous reliance on OOB VIMPs may provide misleading results for researchers who use such methods to guide their research. Given the rapid pace of scientific inquiry using machine learning, it is essential to bring modern knockoff VIMPs that are interpretable and unbiased into widespread applied practice to steer researchers using random forest machine learning toward more meaningful results.

Keywords