Scientific Reports (Jul 2024)

PowerNovo: de novo peptide sequencing via tandem mass spectrometry using an ensemble of transformer and BERT models

  • Denis V. Petrovskiy,
  • Kirill S. Nikolsky,
  • Liudmila I. Kulikova,
  • Vladimir R. Rudnev,
  • Tatiana V. Butkova,
  • Kristina A. Malsagova,
  • Arthur T. Kopylov,
  • Anna L. Kaysheva

DOI
https://doi.org/10.1038/s41598-024-65861-0
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 15

Abstract

Read online

Abstract The primary objective of analyzing the data obtained in a mass spectrometry-based proteomic experiment is peptide and protein identification, or correct assignment of the tandem mass spectrum to one amino acid sequence. Comparison of empirical fragment spectra with the theoretical predicted one or matching with the collected spectra library are commonly accepted strategies of proteins identification and defining of their amino acid sequences. Although these approaches are widely used and are appreciably efficient for the well-characterized model organisms or measured proteins, they cannot detect novel peptide sequences that have not been previously annotated or are rare. This study presents PowerNovo tool for de novo sequencing of proteins using tandem mass spectra acquired in a variety of types of mass analyzers and different fragmentation techniques. PowerNovo involves an ensemble of models for peptide sequencing: model for detecting regularities in tandem mass spectra, precursors, and fragment ions and a natural language processing model, which has a function of peptide sequence quality assessment and helps with reconstruction of noisy sequences. The results of testing showed that the performance of PowerNovo is comparable and even better than widely utilized PointNovo, DeepNovo, Casanovo, and Novor packages. Also, PowerNovo provides complete cycle of processing (pipeline) of mass spectrometry data and, along with predicting the peptide sequence, involves the peptide assembly and protein inference blocks.