G3: Genes, Genomes, Genetics (Aug 2020)

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences

  • Tingting Zhao,
  • Zhongqu Duan,
  • Georgi Z. Genchev,
  • Hui Lu

DOI
https://doi.org/10.1534/g3.120.401280
Journal volume & issue
Vol. 10, no. 8
pp. 2801 – 2809

Abstract

Read online

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.

Keywords