IEEE Access (Jan 2022)

Multidimensional Population Health Modeling: A Data-Driven Multivariate Statistical Learning Approach

  • Zhiyuan Wei,
  • Adil Baran Narin,
  • Sayanti Mukherjee

DOI
https://doi.org/10.1109/ACCESS.2022.3153482
Journal volume & issue
Vol. 10
pp. 22737 – 22755

Abstract

Read online

Population health is multidimensional in nature, having complex relationships with the various health determinants. However, most previous studies investigate a single dimension of population health using linear models, failing to capture the nonlinearity in the data and interdependence of multiple dimensions in health outcomes. In this paper, we propose a data-driven multivariate statistical learning approach to simultaneously model various aspects of population health—characterizing the length and quality of life—as a function of health behaviors, clinical care, socioeconomic factors, physical environment, and demographics. We also propose a novel percentile-based variable selection for multivariate regression, without compromising the model’s generalization performance. We demonstrate the applicability of our proposed data-driven methodological framework using the New York State as a case study. Leveraging cross-validation techniques and statistical hypothesis tests, the results indicate that multivariate tree boosting method outperforms the traditionally-used univariate linear regression model and random forest in modeling multidimensional population health. The variable importance heat-map illustrates the relative influence of the key health determinants on the various dimensions of population health. Partial dependence plots are used to quantify the marginal effects and the nonlinear relationships between the health outcomes and health inputs. Our results show that teen birth rate is strongly associated with both length of life (e.g., child mortality) and quality of life (e.g., physically unhealthy days). Socioeconomic status is the key indicator to predict child and infant mortality. Our proposed framework can be used as a decision support tool for accurately assessing and predicting multivariate population health.

Keywords