Novel machine learning models outperform risk scores in predicting hepatocellular carcinoma in patients with chronic viral hepatitis
Grace Lai-Hung Wong,
Vicki Wing-Ki Hui,
Qingxiong Tan,
Jingwen Xu,
Hye Won Lee,
Terry Cheuk-Fung Yip,
Baoyao Yang,
Yee-Kit Tse,
Chong Yin,
Fei Lyu,
Jimmy Che-To Lai,
Grace Chung-Yan Lui,
Henry Lik-Yuen Chan,
Pong-Chi Yuen,
Vincent Wai-Sun Wong
Affiliations
Grace Lai-Hung Wong
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Institute of Digestive Disease, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
Vicki Wing-Ki Hui
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
Qingxiong Tan
Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region, China
Jingwen Xu
Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region, China
Hye Won Lee
Department of Internal Medicine, Yonsei University College of Medicine, Seoul, South Korea
Terry Cheuk-Fung Yip
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Institute of Digestive Disease, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
Baoyao Yang
Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region, China
Yee-Kit Tse
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
Chong Yin
Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region, China
Fei Lyu
Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region, China
Jimmy Che-To Lai
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Institute of Digestive Disease, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
Grace Chung-Yan Lui
Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
Henry Lik-Yuen Chan
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Union Hospital, Hong Kong Special Administrative Region, China
Pong-Chi Yuen
Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region, China; Department of Computer Science, Room 634, David C Lam Building (DLB634), Hong Kong Baptist University, Kowloon Tong, Hong Kong Special Administrative Region, China. Tel. +852 3411 7091; Fax: +852 3411 7892.
Vincent Wai-Sun Wong
Medical Data Analytic Centre, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Institute of Digestive Disease, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Corresponding authors. Addresses: Department of Medicine and Therapeutics, 9/F Prince of Wales Hospital, 30-32 Ngan Shing Street, Shatin, Hong Kong Special Administrative Region, China. Tel.: +852 3505 1205; Fax: +852 2637 3852;
Background & Aims: Accurate hepatocellular carcinoma (HCC) risk prediction facilitates appropriate surveillance strategy and reduces cancer mortality. We aimed to derive and validate novel machine learning models to predict HCC in a territory-wide cohort of patients with chronic viral hepatitis (CVH) using data from the Hospital Authority Data Collaboration Lab (HADCL). Methods: This was a territory-wide, retrospective, observational, cohort study of patients with CVH in Hong Kong in 2000–2018 identified from HADCL based on viral markers, diagnosis codes, and antiviral treatment for chronic hepatitis B and/or C. The cohort was randomly split into training and validation cohorts in a 7:3 ratio. Five popular machine learning methods, namely, logistic regression, ridge regression, AdaBoost, decision tree, and random forest, were performed and compared to find the best prediction model. Results: A total of 124,006 patients with CVH with complete data were included to build the models. In the training cohort (n = 86,804; 6,821 HCC), ridge regression (area under the receiver operating characteristic curve [AUROC] 0.842), decision tree (0.952), and random forest (0.992) performed the best. In the validation cohort (n = 37,202; 2,875 HCC), ridge regression (AUROC 0.844) and random forest (0.837) maintained their accuracy, which was significantly higher than those of HCC risk scores: CU-HCC (0.672), GAG-HCC (0.745), REACH-B (0.671), PAGE-B (0.748), and REAL-B (0.712) scores. The low cut-off (0.07) of HCC ridge score (HCC-RS) achieved 90.0% sensitivity and 98.6% negative predictive value (NPV) in the validation cohort. The high cut-off (0.15) of HCC-RS achieved high specificity (90.0%) and NPV (95.6%); 31.1% of patients remained indeterminate. Conclusions: HCC-RS from the ridge regression machine learning model accurately predicted HCC in patients with CVH. These machine learning models may be developed as built-in functional keys or calculators in electronic health systems to reduce cancer mortality. Lay summary: Novel machine learning models generated accurate risk scores for hepatocellular carcinoma (HCC) in patients with chronic viral hepatitis. HCC ridge score was consistently more accurate than existing HCC risk scores. These models may be incorporated into electronic medical health systems to develop appropriate cancer surveillance strategies and reduce cancer death.