Overestimated prediction using polygenic prediction derived from summary statistics

David Keetae Park; Mingshen Chen; Seungsoo Kim; Yoonjung Yoonie Joo; Rebekah K. Loving; Hyoung Seop Kim; Jiook Cha; Shinjae Yoo; Jong Hun Kim

doi:10.1186/s12863-023-01151-4

BMC Genomic Data (Sep 2023)

Overestimated prediction using polygenic prediction derived from summary statistics

David Keetae Park,
Mingshen Chen,
Seungsoo Kim,
Yoonjung Yoonie Joo,
Rebekah K. Loving,
Hyoung Seop Kim,
Jiook Cha,
Shinjae Yoo,
Jong Hun Kim

Affiliations

David Keetae Park: Department of Biomedical Engineering, Columbia University
Mingshen Chen: Department of Applied Mathematics & Statistics, Stony Brook University
Seungsoo Kim: Department of Obstetrics and Gynecology, Columbia University Irving Medical Center
Yoonjung Yoonie Joo: Samsung Advanced Institute for Health Sciences & Technology (SAHIST), Sungkyunkwan University, Samsung Medical Center
Rebekah K. Loving: Department of Biology, California Institute of Technology
Hyoung Seop Kim: Department of Physical Medicine and Rehabilitation, Dementia Center, National Health Insurance Service Ilsan Hospital
Jiook Cha: Department of Psychology, Brain and Cognitive Sciences, AI Institute, Seoul National University
Shinjae Yoo: Computational Science Initiative, Brookhaven National Lab. Computer Science and Math
Jong Hun Kim: Department of Neurology, Dementia Center, National Health Insurance Service Ilsan Hospital

DOI: https://doi.org/10.1186/s12863-023-01151-4
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Background When polygenic risk score (PRS) is derived from summary statistics, independence between discovery and test sets cannot be monitored. We compared two types of PRS studies derived from raw genetic data (denoted as rPRS) and the summary statistics for IGAP (sPRS). Results Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. sPRS without APOE is derived from International Genomics of Alzheimer’s Project (IGAP), which records ΔAUC and ΔR2 of 0.051 ± 0.013 and 0.063 ± 0.015 for Alzheimer’s Disease Sequencing Project (ADSP) and 0.060 and 0.086 for Accelerating Medicine Partnership - Alzheimer’s Disease (AMP-AD). On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and test sets are 0.0036 ± 0.0027 (ΔAUC) and 0.0032 ± 0.0028 (ΔR2). For height, ΔR2 is 0.029 ± 0.0037. Conclusion Considering the high heritability of hypertension and height of UK Biobank and sample size of UK Biobank, sPRS results from AD databases are inflated. Independence between discovery and test sets is a well-known basic requirement for PRS studies. However, a lot of PRS studies cannot follow such requirements because of impossible direct comparisons when using summary statistics. Thus, for sPRS, potential duplications should be carefully considered within the same ethnic group.

Published in BMC Genomic Data

ISSN: 2730-6844 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: https://bmcgenomdata.biomedcentral.com/

About the journal

Abstract

Keywords