Machine learning optimized polygenic scores for blood cell traits identify sex-specific trajectories and genetic correlations with disease
Yu Xu,
Dragana Vuckovic,
Scott C. Ritchie,
Parsa Akbari,
Tao Jiang,
Jason Grealey,
Adam S. Butterworth,
Willem H. Ouwehand,
David J. Roberts,
Emanuele Di Angelantonio,
John Danesh,
Nicole Soranzo,
Michael Inouye
Affiliations
Yu Xu
Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; Corresponding author
Dragana Vuckovic
Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK; National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
Scott C. Ritchie
Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
Parsa Akbari
British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
Tao Jiang
British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
Jason Grealey
Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia; Department of Mathematics and Statistics, La Trobe University, Bundoora, VIC 3086, Australia
Adam S. Butterworth
British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
Willem H. Ouwehand
Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK; National Health Service (NHS) Blood and Transplant, Cambridge Biomedical Campus, Cambridge CB2 0PT, UK; Department of Haematology, University of Cambridge, Cambridge CB2 0PT, UK
David J. Roberts
National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK; National Health Service (NHS) Blood and Transplant, Cambridge Biomedical Campus, Cambridge CB2 0PT, UK; National Institute for Health Research Oxford Biomedical Research Centre, University of Oxford and John Radcliffe Hospital, Oxford OX3 9DU, UK
Emanuele Di Angelantonio
British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK; Health Data Science Research Centre, Human Technopole, Milan 20157, Italy; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
John Danesh
British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK; National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
Nicole Soranzo
Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK; National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
Michael Inouye
Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK; British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK; The Alan Turing Institute, London NW1 2DB, UK; Corresponding author
Summary: Genetic association studies for blood cell traits, which are key indicators of health and immune function, have identified several hundred associations and defined a complex polygenic architecture. Polygenic scores (PGSs) for blood cell traits have potential clinical utility in disease risk prediction and prevention, but designing PGS remains challenging and the optimal methods are unclear. To address this, we evaluated the relative performance of 6 methods to develop PGS for 26 blood cell traits, including a standard method of pruning and thresholding (P + T) and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer perceptron (MLP) and convolutional neural network (CNN). We evaluated these optimized PGSs on blood cell trait data from UK Biobank and INTERVAL. We find that PGSs designed using common machine learning methods EN and BR show improved prediction of blood cell traits and consistently outperform other methods. Our analyses suggest EN/BR as the top choices for PGS construction, showing improved performance for 25 blood cell traits in the external validation, with correlations with the directly measured traits increasing by 10%–23%. Ten PGSs showed significant statistical interaction with sex, and sex-specific PGS stratification showed that all of them had substantial variation in the trajectories of blood cell traits with age. Genetic correlations between the PGSs for blood cell traits and common human diseases identified well-known as well as new associations. We develop machine learning-optimized PGS for blood cell traits, demonstrate their relationships with sex, age, and disease, and make these publicly available as a resource.