mSystems (Oct 2024)

Semantic search using protein large language models detects class II microcins in bacterial genomes

  • Anastasiya V. Kulikova,
  • Jennifer K. Parker,
  • Bryan W. Davies,
  • Claus O. Wilke

DOI
https://doi.org/10.1128/msystems.01044-24
Journal volume & issue
Vol. 9, no. 10

Abstract

Read online

ABSTRACT Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date, only 10 class II microcins have been described, and the discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In data sets of Escherichia coli, Klebsiella spp., and Enterobacter spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.IMPORTANCEAntibiotic resistance is becoming an increasingly serious problem in modern medicine, but the development pipeline for conventional antibiotics is not promising. Therefore, alternative approaches to combat bacterial infections are urgently needed. One such approach may be to employ naturally occurring antibacterial peptides produced by bacteria to kill competing bacteria. A promising class of such peptides are class II microcins. However, only a small number of class II microcins have been discovered to date, and the discovery of further such microcins has been hampered by their high sequence divergence and short length, which can cause sequence-based search methods to fail. Here, we demonstrate that a more robust method for microcin discovery can be built on the basis of a protein large language model, and we use this method to identify several putative novel class II microcins.

Keywords