Journal of Hepatocellular Carcinoma (Feb 2022)

Hepatocellular Carcinoma Risk Prediction in the NIH-AARP Diet and Health Study Cohort: A Machine Learning Approach

  • Thomas J,
  • Liao LM,
  • Sinha R,
  • Patel T,
  • Antwi SO

Journal volume & issue
Vol. Volume 9
pp. 69 – 81

Abstract

Read online

Jonathan Thomas,1 Linda M Liao,2 Rashmi Sinha,2 Tushar Patel,1,* Samuel O Antwi3,* 1Department of Transplantation, Mayo Clinic, Jacksonville, FL, USA; 2Division of Cancer Epidemiology and Genetics, The National Cancer Institute, Bethesda, MD, USA; 3Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL, USA*These authors contributed equally to this workCorrespondence: Samuel O Antwi, Department of Quantitative Health Sciences, Mayo Clinic, 4500 San Pablo Road South, Vincent Stabile Building 756N, Jacksonville, FL, 32224, USA, Tel +1-904-953-0310, Fax +1-904-953-1447, Email [email protected]: Prediction of hepatocellular carcinoma (HCC) development in persons with known risk factors remain a challenge and is an urgent unmet need, considering projected increases in HCC incidence and mortality in the US. We aimed to use machine learning techniques to identify a set of demographic, lifestyle, and health history information that can be used simultaneously for population-level HCC risk prediction.Methods: Data from 377,065 participants of the NIH-AARP Diet and Health Study, among whom 647 developed HCC over 16 years of follow-up, were analyzed. The sample was randomly divided into independent training (60%) and validation (40%) sets. We evaluated 123 participant characteristics and tested 15 different machine learning algorithms for robustness in predicting HCC risk. Separately, we evaluated variables selected from multivariable logistic regression for risk prediction.Results: The random under-sampling boosting (RUSBoost) algorithm performed best during model testing. Fourteen participant characteristics were selected for risk prediction based on differences between cases and controls (Bonferroni-corrected p-values < 0.0004) and from the most frequently used variables in the initial two decision trees of the RUSBoost learner trees. A predictive model based on the 14 variables had an AUC of 0.72 (sensitivity=0.68, specificity=0.63) and independent validation AUC of 0.65 (sensitivity=0.68, specificity=0.63). A subset of 9 variables identified through logistic regression also had an AUC of 0.72 (sensitivity=0.67, specificity=0.63) and independent validation AUC of 0.65 (sensitivity=0.70, specificity=0.61).Conclusion: Population-level HCC risk prediction can be performed with a machine learning-based algorithm and could inform strategies for improving HCC risk reduction in at-risk groups.Keywords: HCC, hepatocellular carcinoma, liver cancer, machine learning, risk prediction

Keywords