Imputation and quality control steps for combining multiple genome-wide datasets

Shefali S Verma; Mariza ede Andrade; Gerard eTromp; Helena eKuivaniemi; Elizabeth ePugh; Bahram eNamjou; Shubhabrata eMukherjee; Gail P Jarvik; Leah Claire Kottyan; Amber eBurt; Yuki eBradford; Gretta D Armstrong; Kimberly eDerr; Dana eCrawford; Jonathan L Haines; Rongling eLi; David eCrosslin; Marylyn D Ritchie

doi:10.3389/fgene.2014.00370

Frontiers in Genetics (Dec 2014)

Imputation and quality control steps for combining multiple genome-wide datasets

Shefali S Verma,
Mariza ede Andrade,
Gerard eTromp,
Helena eKuivaniemi,
Elizabeth ePugh,
Bahram eNamjou,
Shubhabrata eMukherjee,
Gail P Jarvik,
Leah Claire Kottyan,
Amber eBurt,
Yuki eBradford,
Gretta D Armstrong,
Kimberly eDerr,
Dana eCrawford,
Jonathan L Haines,
Rongling eLi,
David eCrosslin,
Marylyn D Ritchie

Affiliations

Shefali S Verma: The Pennsylvania State University
Mariza ede Andrade: Mayo Clinic
Gerard eTromp: The Sigfried and Janet Weis Center for Research, Geisinger Health System
Helena eKuivaniemi: The Sigfried and Janet Weis Center for Research, Geisinger Health System
Elizabeth ePugh: Center for Inherited Disease Research, John Hopkins University
Bahram eNamjou: Cincinnati Children’s Hospital Medical Center
Shubhabrata eMukherjee: University of Washington
Gail P Jarvik: University of Washington
Leah Claire Kottyan: Cincinnati Children’s Hospital Medical Center
Amber eBurt: University of Washington
Yuki eBradford: Vanderbilt University
Gretta D Armstrong: The Pennsylvania State University
Kimberly eDerr: The Sigfried and Janet Weis Center for Research, Geisinger Health System
Dana eCrawford: Vanderbilt University
Jonathan L Haines: Case Western University
Rongling eLi: Division of Genomic Medicine, National Human Genome Research Institute
David eCrosslin: University of Washington
Marylyn D Ritchie: The Pennsylvania State University

DOI: https://doi.org/10.3389/fgene.2014.00370
Journal volume & issue: Vol. 5

Abstract

Read online

The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

Published in Frontiers in Genetics

ISSN: 1664-8021 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Biology (General): Genetics
Website: http://journal.frontiersin.org/journal/genetics

About the journal

Abstract

Keywords