The Plant Genome (Nov 2019)

A Systematic Gene‐Centric Approach to Define Haplotypes and Identify Alleles on the Basis of Dense Single Nucleotide Polymorphism Datasets

  • Aurélie Tardivel,
  • Davoud Torkamaneh,
  • Marc‐André Lemay,
  • François Belzile,
  • Louise S. O'Donoughue

DOI
https://doi.org/10.3835/plantgenome2018.08.0061
Journal volume & issue
Vol. 12, no. 3
pp. n/a – n/a

Abstract

Read online

Core Ideas A gene‐centric approach for haplotype definition was developed and implemented in R. The tool allows for allelic characterization at given loci in germplasm collections. Allelic status at four maturity genes is predicted on the basis of marker genotyping data. Assessing the allelic diversity within a germplasm collection and identifying individuals carrying favorable alleles is challenging. Advances in high‐throughput technologies allow the genotyping of many individuals for thousands of markers but bridging the gap between single nucleotide polymorphisms (SNPs) and relevant alleles remains difficult. We developed a systematic approach that defines haplotypes from large SNP catalogs that aims to identify haplotypes that can be equated to alleles at given genes. Unlike haplotype visualization tools, our approach selects SNP markers that flank a gene and define haplotypes that correspond to this gene's alleles. We tested this approach on four known soybean [Glycine max (L.) Merr.] maturity genes (E1, GmGia, GmPhyA3, and GmPhyA2) in a collection of 67 lines and two genotypic datasets [a SNP array and genotyping‐by‐sequencing (GBS)]. For E1, GmGia, and GmPhyA3, we identified SNP haplotypes such that the allele found at these genes could be accurately predicted from the haplotype in 97.3% of the cases. For these genes, of the 12 known alleles in the collection, 10 and 8 could be correctly predicted from the haplotypes found with the SNP array and GBS datasets, with success rates of 98 and 97% for all allele–line combinations, respectively. The approach proved equally successful for data derived from a SNP array and GBS. However, in the case of GmPhyA2, a lack of markers in the genomic region prevented the identification of alleles, regardless of the dataset. We demonstrate the feasibility and reproducibility of our approach and identify limits to its applicability.