Computational and Structural Biotechnology Journal (Jan 2021)

A survey of direct-to-consumer genotype data, and quality control tool (GenomePrep) for research

  • Chang Lu,
  • Bastian Greshake Tzovaras,
  • Julian Gough

Journal volume & issue
Vol. 19
pp. 3747 – 3754

Abstract

Read online

Two major forces have contributed to the fast growth of human genetic data. One from medical research supported by governments and academic institutes; the other from direct-to-consumer (DTC) sequencing companies. While the former benefits from meticulously designed sequencing standards and quality control procedures, the latter comes in various formats and sequencing methods which are subject to changes over time and the particular needs of different companies. Thanks to the general public who shared their DNA data without constraint, here we provide a review for over 7000 genomes made public between 2011 and 2020, and produced by over six DTC sequencing companies. An open source tool-kit to systematically parse, quality check and filter genome files and statistically problematic alleles is provided to prepare consumer DNA datasets for research. The GenomePrep output is available in two common DNA datafile formats to enable further analysis with other tools. We also provide for download the combined output for all OpenSNP array genomes processed in this paper in a single data freeze file.

Keywords