Predicting Phenotypes From High-Dimensional Genomes Using Gradient Boosting Decision Trees

Tingxi Yu; Li Wang; Wuping Zhang; Guofang Xing; Jiwan Han; Fuzhong Li; Chunqing Cao

doi:10.1109/ACCESS.2022.3171341

IEEE Access (Jan 2022)

Predicting Phenotypes From High-Dimensional Genomes Using Gradient Boosting Decision Trees

Tingxi Yu,
Li Wang,
Wuping Zhang,
Guofang Xing,
Jiwan Han,
Fuzhong Li,
Chunqing Cao

Affiliations

Tingxi Yu: ORCiD; College of Software, Shanxi Agricultural University, Taigu, Shanxi, China
Li Wang: ORCiD; College of Software, Shanxi Agricultural University, Taigu, Shanxi, China
Wuping Zhang: ORCiD; College of Software, Shanxi Agricultural University, Taigu, Shanxi, China
Guofang Xing: ORCiD; College of Agriculture, Shanxi Agricultural University, Taigu, Shanxi, China
Jiwan Han: College of Software, Shanxi Agricultural University, Taigu, Shanxi, China
Fuzhong Li: ORCiD; College of Software, Shanxi Agricultural University, Taigu, Shanxi, China
Chunqing Cao: ORCiD; College of Software, Shanxi Agricultural University, Taigu, Shanxi, China

DOI: https://doi.org/10.1109/ACCESS.2022.3171341
Journal volume & issue: Vol. 10
pp. 48126 – 48140

Abstract

Read online

xsxsGenomic selection (GS) is an emerging technique for predicting unknown phenotypes using genome-wide marker coverage, allowing the use of efficient computational models to select individuals with high phenotypic values as candidate breeding populations. However, GS remains challenging inefficient crop breeding due to the limited size of training populations, the nature of genotype-environment interactions, and the complex interaction patterns between molecular markers. In this study, we use ensemble learning algorithms to construct gradient boosted decision tree (GBDT) models to achieve the prediction of phenotypic values from genotypic markers. We trained GBDT using the wheat GS dataset and compared the predictive performance with six other widely used GS models. The mean normalized discounted cumulative gain (MNDCG) method was used to evaluate the ability of each model to select individuals with high phenotypic values. The results of the study show that: (1) Bayesian models converge and reach a steady-state only when a sufficient number of iterations are set. As the number of iterations increases, the prediction accuracy of the Bayesian model increases, but the computational efficiency of the model decreases significantly. When 200,000 iterations are performed, the prediction performance of the five Bayesian models is similar and converges to a smooth state, and their prediction accuracy is 7.60% better than the GBDT model overall, and the computational efficiency of the GBDT model is 70 times that of the Bayesian model. (2) Overall, the overall prediction performance of the RRBLUP model was the best, but for some traits, the GBDT model still had a higher ability to select individuals with high phenotypic values than the RRBLUP and Bayesian models. (3) The prediction accuracy of GBDT and RRBLUP models was influenced by the subset of markers, and the higher the number of markers the higher the prediction accuracy of the models, so the reasonable selection of genetic marker data of appropriate size could improve the prediction performance of the models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords