A machine learning-based data mining in medical examination data: a biological features-based biological age prediction model

Qing Yang; Sunan Gao; Junfen Lin; Ke Lyu; Zexu Wu; Yuhao Chen; Yinwei Qiu; Yanrong Zhao; Wei Wang; Tianxiang Lin; Huiyun Pan; Ming Chen

doi:10.1186/s12859-022-04966-7

BMC Bioinformatics (Oct 2022)

A machine learning-based data mining in medical examination data: a biological features-based biological age prediction model

Qing Yang,
Sunan Gao,
Junfen Lin,
Ke Lyu,
Zexu Wu,
Yuhao Chen,
Yinwei Qiu,
Yanrong Zhao,
Wei Wang,
Tianxiang Lin,
Huiyun Pan,
Ming Chen

Affiliations

Qing Yang: Zhejiang Provincial Center for Disease Control and Prevention
Sunan Gao: College of Biosystems Engineering and Food Science, Zhejiang University
Junfen Lin: Zhejiang Provincial Center for Disease Control and Prevention
Ke Lyu: College of Life Sciences, Zhejiang University
Zexu Wu: College of Life Sciences, Zhejiang University
Yuhao Chen: College of Life Sciences, Zhejiang University
Yinwei Qiu: Zhejiang Provincial Center for Disease Control and Prevention
Yanrong Zhao: Zhejiang Provincial Center for Disease Control and Prevention
Wei Wang: Zhejiang Provincial Center for Disease Control and Prevention
Tianxiang Lin: Zhejiang Provincial Center for Disease Control and Prevention
Huiyun Pan: The First Affiliated Hospital of School of Medicine, Zhejiang University
Ming Chen: College of Life Sciences, Zhejiang University

DOI: https://doi.org/10.1186/s12859-022-04966-7
Journal volume & issue: Vol. 23, no. 1
pp. 1 – 23

Abstract

Read online

Abstract Background Biological age (BA) has been recognized as a more accurate indicator of aging than chronological age (CA). However, the current limitations include: insufficient attention to the incompleteness of medical data for constructing BA; Lack of machine learning-based BA (ML-BA) on the Chinese population; Neglect of the influence of model overfitting degree on the stability of the association results. Methods and results Based on the medical examination data of the Chinese population (45–90 years), we first evaluated the most suitable missing interpolation method, then constructed 14 ML-BAs based on biomarkers, and finally explored the associations between ML-BAs and health statuses (healthy risk indicators and disease). We found that round-robin linear regression interpolation performed best, while AutoEncoder showed the highest interpolation stability. We further illustrated the potential overfitting problem in ML-BAs, which affected the stability of ML-Bas’ associations with health statuses. We then proposed a composite ML-BA based on the Stacking method with a simple meta-model (STK-BA), which overcame the overfitting problem, and associated more strongly with CA (r = 0.66, P < 0.001), healthy risk indicators, disease counts, and six types of disease. Conclusion We provided an improved aging measurement method for middle-aged and elderly groups in China, which can more stably capture aging characteristics other than CA, supporting the emerging application potential of machine learning in aging research.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords