Frontiers in Virology (Apr 2024)

NeoRdRp2 with improved seed data, annotations, and scoring

  • Shoichi Sakaguchi,
  • Takashi Nakano,
  • So Nakagawa,
  • So Nakagawa,
  • So Nakagawa

DOI
https://doi.org/10.3389/fviro.2024.1378695
Journal volume & issue
Vol. 4

Abstract

Read online

RNA-dependent RNA polymerase (RdRp) is a marker gene for RNA viruses; thus, it is widely used to identify RNA viruses from metatranscriptome data. However, because of the high diversity of RdRp domains, it remains difficult to identify RNA viruses using RdRp sequences. To overcome this problem, we created a NeoRdRp database containing 1,182 hidden Markov model (HMM) profiles utilizing 12,502 RdRp domain sequences. Since the development of this database, more RNA viruses have been discovered, mainly through metatranscriptome sequencing analyses. To identify RNA viruses comprehensively and specifically, we updated the NeoRdRp by incorporating recently reported RNA viruses. To this end, 557,197 RdRp-containing sequences were used as seed RdRp datasets. These sequences were processed through deduplication, clustering, alignment, and splitting, thereby generating 19,394 HMM profiles. We validated the updated NeoRdRp database, using the UniProtKB dataset and found that the recall and specificity rates were improved to 99.4% and 81.6%, from 97.2% and 76.8% in the previous version, respectively. Comparisons of eight different RdRp search tools showed that NeoRdRp2 exhibited balanced RdRp and nonspecific detection power. Expansion of the annotated RdRp datasets is expected to further accelerate the discovery of novel RNA viruses from various transcriptome datasets. The HMM profiles of NeoRdRp2 and their annotations are available at https://github.com/shoichisakaguchi/NeoRdRp.

Keywords