Informatics in Medicine Unlocked (Jan 2022)

Comparing conventional statistical models and machine learning in a small cohort of South African cardiac patients

  • Preesha Premsagar,
  • Colleen Aldous,
  • Tonya M. Esterhuizen,
  • Byron J. Gomes,
  • Jason William Gaskell,
  • David L. Tabb

Journal volume & issue
Vol. 34
p. 101103

Abstract

Read online

Background: Machine learning is used to process big data volumes with complex non-linear relationships between predictive variables and predictions. Research into the usefulness of machine learning in small data volumes remains limited. Aim: To compare conventional statistical methods and machine learning to predict angiogram outcomes in a small cohort of South African cardiac patients. Methods: This is a retrospective study on patients with cardiac risk factors at Inkosi Albert Luthuli Central Hospital, Durban, South Africa, from 2002 to 2008. Models were designed using predictive risk factors to forecast a binary angiogram outcome (normal or abnormal) by applying conventional statistical models (binary logistic and log binomial) and stacking ensemble machine learning. Results: The outcome prevalence of abnormal angiograms was 99/173 (57%). Predictive data was used to model this outcome. The binary logistic regression model, which estimates odds ratio, was unsuitable. The log binomial model, which estimates relative risk, did not converge after various stepwise modelling attempts. Thereafter, machine learning models were used. These included logistic regression, k-nearest neighbour, decision tree, support vector machine, and naïve Bayes. The ensemble model amalgamated all algorithms and showed accuracy >70% and excellent performance at different thresholds with an area under the curve (AUC) > 80%. Discussion: The logistic regression model was unsuitable because an odds ratio would have been unreliable and overestimated the true effect since the outcome prevalence was >10%. A log binomial model with relative risk estimates did not converge, possibly owing to the multiple predictive variables. Overall, conventional statistical models were unsuccessful in this instance. Machine learning models had limitations from a small dataset. However, the combined modelling with the stacking ensemble method produced good results in the small, homogenous database by exploiting the strengths of each contributing algorithm. Conclusions: Researchers may apply machine learning when conventional statistical models are inconclusive in homogenous small databases with multiple variables and a complex relationship to the outcome. Machine learning is a viable option even with relatively small cohorts if the number of predictive variables is also small.

Keywords