The GIAB genomic stratifications resource for human reference genomes

Nathan Dwarshuis; Divya Kalra; Jennifer McDaniel; Philippe Sanio; Pilar Alvarez Jerez; Bharati Jadhav; Wenyu (Eddy) Huang; Rajarshi Mondal; Ben Busby; Nathan D. Olson; Fritz J. Sedlazeck; Justin Wagner; Sina Majidian; Justin M. Zook

doi:10.1038/s41467-024-53260-y

Nature Communications (Oct 2024)

The GIAB genomic stratifications resource for human reference genomes

Nathan Dwarshuis,
Divya Kalra,
Jennifer McDaniel,
Philippe Sanio,
Pilar Alvarez Jerez,
Bharati Jadhav,
Wenyu (Eddy) Huang,
Rajarshi Mondal,
Ben Busby,
Nathan D. Olson,
Fritz J. Sedlazeck,
Justin Wagner,
Sina Majidian,
Justin M. Zook

Affiliations

Nathan Dwarshuis: Material Measurement Laboratory, National Institute of Standards and Technology
Divya Kalra: Human Genome Sequencing Center, Baylor College of Medicine
Jennifer McDaniel: Material Measurement Laboratory, National Institute of Standards and Technology
Philippe Sanio: University of Applied Sciences Upper Austria - FH Hagenberg
Pilar Alvarez Jerez: Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health
Bharati Jadhav: Department of Genetics and Genomic Sciences and Mindich Child Health and Development Institute, Icahn School of Medicine at Mount, Hess Center for Science and Medicine
Wenyu (Eddy) Huang: Department of Computer Science, College of Engineering, Rice University
Rajarshi Mondal: Department of Bioinformatics, Pondicherry University
Ben Busby: DNA Nexus
Nathan D. Olson: Material Measurement Laboratory, National Institute of Standards and Technology
Fritz J. Sedlazeck: Human Genome Sequencing Center, Baylor College of Medicine
Justin Wagner: Material Measurement Laboratory, National Institute of Standards and Technology
Sina Majidian: Department of Computational Biology, University of Lausanne
Justin M. Zook: Material Measurement Laboratory, National Institute of Standards and Technology

DOI: https://doi.org/10.1038/s41467-024-53260-y
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications . We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal