Informatics in Medicine Unlocked (Jan 2022)
Risk prediction for repeated measures health outcomes: A divide and recombine framework
Abstract
We propose a machine learning framework for risk prediction for binary response sequence observed over time, creating a trajectory for disease progression and regression. The proposed framework employs a divide and recombine technique using the relation between marginal, conditional, and joint probability models from probability theory. To demonstrate the framework, the data from the US Health and Retirement Study with seven follow-ups for the response, the activity of daily living index (ADL), and risk factors have been used. To assess the effects of the risk factors on ADL, the proposed framework adapted regressive logistic regression, logistic regression with the lasso, support vector machines, classification tree, random forests, and neural network models. The models are tuned and evaluated on the training and test data containing 75% and 25% of the cases, respectively. The test data accuracies varied from 92% to 95% across different follow-ups with high specificity and sensitivity. The accuracy, sensitivity, and specificity for the ensemble of the six models are found very high, all above 90%. Inclusion of interaction terms between the risk factors, risk factors and historical ADL, and historical ADL from different follow-ups in the regressive logistic model shows noticeable improvements in accuracy, sensitivity, and specificity. Adjusting the probability threshold for classification shows a considerable increase in sensitivity. The framework provides a general and flexible approach in addressing the issue of risk predictions for health-related response, which is repeated over time and longitudinal in nature. This method can be used in other applications to analyze big data.