A big data approach to metagenomics for all-food-sequencing

Robin Kobus; José M. Abuín; André Müller; Sören Lukas Hellmann; Juan C. Pichel; Tomás F. Pena; Andreas Hildebrandt; Thomas Hankeln; Bertil Schmidt

doi:10.1186/s12859-020-3429-6

BMC Bioinformatics (Mar 2020)

A big data approach to metagenomics for all-food-sequencing

Robin Kobus,
José M. Abuín,
André Müller,
Sören Lukas Hellmann,
Juan C. Pichel,
Tomás F. Pena,
Andreas Hildebrandt,
Thomas Hankeln,
Bertil Schmidt

Affiliations

Robin Kobus: Department of Computer Science, Johannes Gutenberg University
José M. Abuín: IPCA, Polytechnic Institute of Cávado and Ave
André Müller: Department of Computer Science, Johannes Gutenberg University
Sören Lukas Hellmann: Molecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg University
Juan C. Pichel: CiTIUS, Universidade de Santiago de Compostela
Tomás F. Pena: CiTIUS, Universidade de Santiago de Compostela
Andreas Hildebrandt: Department of Computer Science, Johannes Gutenberg University
Thomas Hankeln: Molecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg University
Bertil Schmidt: Department of Computer Science, Johannes Gutenberg University

DOI: https://doi.org/10.1186/s12859-020-3429-6
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords