Methods in Ecology and Evolution (Oct 2021)

Unbiased population heterozygosity estimates from genome‐wide sequence data

  • Thomas L. Schmidt,
  • Moshe‐Elijah Jasper,
  • Andrew R Weeks,
  • Ary A Hoffmann

DOI
https://doi.org/10.1111/2041-210X.13659
Journal volume & issue
Vol. 12, no. 10
pp. 1888 – 1898

Abstract

Read online

Abstract Heterozygosity is a metric of genetic variability frequently used to inform the management of threatened taxa. Estimating observed and expected heterozygosities from genome‐wide sequence data has become increasingly common, and these estimates are often derived directly from genotypes at single nucleotide polymorphism (SNP) markers. While many SNP markers can provide precise estimates of genetic processes, the results of ‘downstream’ analysis with these markers may depend heavily on ‘upstream’ filtering decisions. Here we explore the downstream consequences of sample size, rare allele filtering, missing data thresholds and known population structure on estimates of observed and expected heterozygosity using two reduced‐representation sequencing datasets, one from the mosquito Aedes aegypti (ddRADseq) and the other from a threatened grasshopper, Keyacris scurra (DArTseq). We show that estimates based on polymorphic markers only (i.e. SNP heterozygosity) are always biased by global sample size (N), with smaller N producing larger estimates. By contrast, results are unbiased by sample size when calculations consider monomorphic as well as polymorphic sequence information (i.e. genome‐wide or autosomal heterozygosity). SNP heterozygosity is also biased when differentiated populations are analysed together while autosomal heterozygosity remains unbiased. We also show that when nucleotide sites with missing genotypes are included, observed and expected heterozygosity estimates diverge in proportion to the amount of missing data permitted at each site. We make three recommendations for estimating genome‐wide heterozygosity: (a) autosomal heterozygosity should be reported instead of (or in addition to) SNP heterozygosity; (b) sites with any missing data should be omitted and (c) populations should be analysed in independent runs. This should facilitate comparisons within and across studies and between observed and expected measures of heterozygosity.

Keywords