Animal Research and One Health (Aug 2023)
The efficient phasing and imputation pipeline of low‐coverage whole genome sequencing data using a high‐quality and publicly available reference panel in cattle
Abstract
Abstract Low‐coverage whole genome sequencing (lcWGS) has great potential to effectively genotype large‐scale population and to provide solid data for imputation; however, the time for imputation needs to be optimized. There is also no publicly available reference panel for whole genome selection in cattle. Here, we proposed a combination of Beagle v5.4 for phasing and GLIMPSE2 for imputation, which is fast and accurate for cattle lcWGS data. Furthermore, we established a multi‐breed reference panel with 61.8 million SNPs based on 2976 worldwide cattle, of which 1766 were bulls, by evaluating diversity and the size of the reference panel. The evaluation of imputation accuracy was conducted using new reference panel for both lcWGS and Bovine BeadChip data. The average concordance rate in Holstein was 99.6%, 99.6%, and 99.5% for 1X, 0.5X, and 0.1X lcWGS data, 99.5% and 99.0% for 777K and 50K chip data, and it was 98.8% for 1X lcWGS data in Simmental. We further investigated the factors affecting the imputation accuracy of lcWGS data and discovered that segmental duplication, structural variant, and guanine‐cytosine content were the top three factors. Interestingly, we found that 10 regions longer than 0.5 Mb showed low imputation accuracy enriched with immune function, such as 96.1% characterized genes in regions of chromosome 10, with more attention being paid on downstream immune‐related analysis. Our study provides the workflow of imputing lcWGS data and establishes the first high‐quality cattle reference panel with free access, which provides a resource to conduct subsequent large‐scale genome‐wide association studies and genomic selection.
Keywords