A machine learning-based service for estimating quality of genomes using PATRIC

Bruce Parrello; Rory Butler; Philippe Chlenski; Robert Olson; Jamie Overbeek; Gordon D. Pusch; Veronika Vonstein; Ross Overbeek

doi:10.1186/s12859-019-3068-y

BMC Bioinformatics (Oct 2019)

A machine learning-based service for estimating quality of genomes using PATRIC

Bruce Parrello,
Rory Butler,
Philippe Chlenski,
Robert Olson,
Jamie Overbeek,
Gordon D. Pusch,
Veronika Vonstein,
Ross Overbeek

Affiliations

Bruce Parrello: Fellowship for Interpretation of Genomes
Rory Butler: Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory
Philippe Chlenski: Fellowship for Interpretation of Genomes
Robert Olson: Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory
Jamie Overbeek: Fellowship for Interpretation of Genomes
Gordon D. Pusch: Fellowship for Interpretation of Genomes
Veronika Vonstein: Fellowship for Interpretation of Genomes
Ross Overbeek: Fellowship for Interpretation of Genomes

DOI: https://doi.org/10.1186/s12859-019-3068-y
Journal volume & issue: Vol. 20, no. 1
pp. 1 – 9

Abstract

Read online

Abstract Background Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. Description We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. Conclusion EvalG and EvalCon facilitate the rapid quality control and exploration of PATRIC-annotated draft genomes.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords