Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data

Ji Hwan Park; Han Eol Cho; Jong Hun Kim; Melanie M. Wall; Yaakov Stern; Hyunsun Lim; Shinjae Yoo; Hyoung Seop Kim; Jiook Cha

doi:10.1038/s41746-020-0256-0

npj Digital Medicine (Mar 2020)

Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data

Ji Hwan Park,
Han Eol Cho,
Jong Hun Kim,
Melanie M. Wall,
Yaakov Stern,
Hyunsun Lim,
Shinjae Yoo,
Hyoung Seop Kim,
Jiook Cha

Affiliations

Ji Hwan Park: Computational Science Initiative, Brookhaven National Laboratory
Han Eol Cho: Department of Rehabilitation Medicine, Gangnam Severance Hospital and Rehabilitation Institute of Neuromuscular Disease, Yonsei University College of Medicine
Jong Hun Kim: Department of Neurology, Dementia Center, National Health Insurance Service Ilsan Hospital
Melanie M. Wall: Department of Psychiatry, Vagelos College of Physicians and Surgeons, Columbia University
Yaakov Stern: Department of Psychiatry, Vagelos College of Physicians and Surgeons, Columbia University
Hyunsun Lim: Research and Analysis Team, National Health Insurance Service Ilsan Hospital
Shinjae Yoo: Computational Science Initiative, Brookhaven National Laboratory
Hyoung Seop Kim: Department of Physical Medicine and Rehabilitation, Dementia Center, National Health Insurance Service Ilsan Hospital
Jiook Cha: Department of Psychiatry, Vagelos College of Physicians and Surgeons, Columbia University

DOI: https://doi.org/10.1038/s41746-020-0256-0
Journal volume & issue: Vol. 3, no. 1
pp. 1 – 7

Abstract

Read online

Abstract Nationwide population-based cohort provides a new opportunity to build an automated risk prediction model based on individuals’ history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer’s disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N = 40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness and socio-demographics. To define incident AD we considered two operational definitions: “definite AD” with diagnostic codes and dementia medication (n = 614) and “probable AD” with only diagnosis (n = 2026). We trained and validated random forest, support vector machine and logistic regression to predict incident AD in 1, 2, 3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on “definite AD” and “probable AD” outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal