Protein embedding based alignment

Benjamin Giovanni Iovino; Yuzhen Ye

doi:10.1186/s12859-024-05699-5

BMC Bioinformatics (Feb 2024)

Protein embedding based alignment

Benjamin Giovanni Iovino,
Yuzhen Ye

Affiliations

Benjamin Giovanni Iovino: Luddy School of Informatics, Computing and Engineering, Indiana University
Yuzhen Ye: Luddy School of Informatics, Computing and Engineering, Indiana University

DOI: https://doi.org/10.1186/s12859-024-05699-5
Journal volume & issue: Vol. 25, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Purpose Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords