MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

Rafael Peres da Silva; Chayaporn Suphavilai; Niranjan Nagarajan

doi:10.1186/s12859-024-05760-3

BMC Bioinformatics (Apr 2024)

MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

Rafael Peres da Silva,
Chayaporn Suphavilai,
Niranjan Nagarajan

Affiliations

Rafael Peres da Silva: School of Computing, National University of Singapore
Chayaporn Suphavilai: Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS)
Niranjan Nagarajan: School of Computing, National University of Singapore

DOI: https://doi.org/10.1186/s12859-024-05760-3
Journal volume & issue: Vol. 25, no. S1
pp. 1 – 19

Abstract

Read online

Abstract Background With the rapid increase in throughput of long-read sequencing technologies, recent studies have explored their potential for taxonomic classification by using alignment-based approaches to reduce the impact of higher sequencing error rates. While alignment-based methods are generally slower, k-mer-based taxonomic classifiers can overcome this limitation, potentially at the expense of lower sensitivity for strains and species that are not in the database. Results We present MetageNN, a memory-efficient long-read taxonomic classifier that is robust to sequencing errors and missing genomes. MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. Benchmarking MetageNN against other machine learning approaches for taxonomic classification (GeNet) showed substantial improvements with long-read data (20% improvement in F1 score). By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete. It surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis. Notably, at the community level, MetageNN consistently demonstrated higher sensitivities than the previously mentioned tools. Furthermore, MetageNN requires 7× faster than MetaMaps and GeNet and > 2× faster than MEGAN-LR and MMseqs2. Conclusion This proof of concept work demonstrates the utility of machine-learning-based methods for taxonomic classification using long reads. MetageNN can be used on sequences not classified by conventional methods and offers an alternative approach for memory-efficient classifiers that can be optimized further.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords