A Distributed Whole Genome Sequencing Benchmark Study

Richard D. Corbett; Robert Eveleigh; Joe Whitney; Namrata Barai; Mathieu Bourgey; Eric Chuah; Joanne Johnson; Richard A. Moore; Neda Moradin; Karen L. Mungall; Sergio Pereira; Miriam S. Reuter; Bhooma Thiruvahindrapuram; Richard F. Wintle; Jiannis Ragoussis; Lisa J. Strug; Jo-Anne Herbrick; Naveed Aziz; Steven J. M. Jones; Mark Lathrop; Stephen W. Scherer; Alfredo Staffa; Andrew J. Mungall

doi:10.3389/fgene.2020.612515

Frontiers in Genetics (Dec 2020)

A Distributed Whole Genome Sequencing Benchmark Study

Richard D. Corbett,
Robert Eveleigh,
Joe Whitney,
Namrata Barai,
Mathieu Bourgey,
Eric Chuah,
Joanne Johnson,
Richard A. Moore,
Neda Moradin,
Karen L. Mungall,
Sergio Pereira,
Miriam S. Reuter,
Bhooma Thiruvahindrapuram,
Richard F. Wintle,
Jiannis Ragoussis,
Lisa J. Strug,
Jo-Anne Herbrick,
Naveed Aziz,
Steven J. M. Jones,
Mark Lathrop,
Stephen W. Scherer,
Alfredo Staffa,
Andrew J. Mungall

Affiliations

Richard D. Corbett: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada
Robert Eveleigh: McGill Genome Centre, McGill University, Montreal, QC, Canada
Joe Whitney: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Namrata Barai: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Mathieu Bourgey: McGill Genome Centre, McGill University, Montreal, QC, Canada
Eric Chuah: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada
Joanne Johnson: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada
Richard A. Moore: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada
Neda Moradin: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Karen L. Mungall: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada
Sergio Pereira: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Miriam S. Reuter: Canada’s Genomics Enterprise (CGEn), The Hospital for Sick Children, Toronto, ON, Canada
Bhooma Thiruvahindrapuram: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Richard F. Wintle: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Jiannis Ragoussis: McGill Genome Centre, McGill University, Montreal, QC, Canada
Lisa J. Strug: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Jo-Anne Herbrick: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Naveed Aziz: Canada’s Genomics Enterprise (CGEn), The Hospital for Sick Children, Toronto, ON, Canada
Steven J. M. Jones: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada
Mark Lathrop: McGill Genome Centre, McGill University, Montreal, QC, Canada
Stephen W. Scherer: The Centre for Applied Genomics, The Hospital for Sick Children and University of Toronto, Toronto, ON, Canada
Alfredo Staffa: McGill Genome Centre, McGill University, Montreal, QC, Canada
Andrew J. Mungall: Canada’s Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada

DOI: https://doi.org/10.3389/fgene.2020.612515
Journal volume & issue: Vol. 11

Abstract

Read online

Population sequencing often requires collaboration across a distributed network of sequencing centers for the timely processing of thousands of samples. In such massive efforts, it is important that participating scientists can be confident that the accuracy of the sequence data produced is not affected by which center generates the data. A study was conducted across three established sequencing centers, located in Montreal, Toronto, and Vancouver, constituting Canada’s Genomics Enterprise (www.cgen.ca). Whole genome sequencing was performed at each center, on three genomic DNA replicates from three well-characterized cell lines. Secondary analysis pipelines employed by each site were applied to sequence data from each of the sites, resulting in three datasets for each of four variables (cell line, replicate, sequencing center, and analysis pipeline), for a total of 81 datasets. These datasets were each assessed according to multiple quality metrics including concordance with benchmark variant truth sets to assess consistent quality across all three conditions for each variable. Three-way concordance analysis of variants across conditions for each variable was performed. Our results showed that the variant concordance between datasets differing only by sequencing center was similar to the concordance for datasets differing only by replicate, using the same analysis pipeline. We also showed that the statistically significant differences between datasets result from the analysis pipeline used, which can be unified and updated as new approaches become available. We conclude that genome sequencing projects can rely on the quality and reproducibility of aggregate data generated across a network of distributed sites.

Published in Frontiers in Genetics

ISSN: 1664-8021 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Biology (General): Genetics
Website: http://journal.frontiersin.org/journal/genetics

About the journal

Abstract

Keywords