BIO Web of Conferences (Jan 2025)

ClusterEmbed: Lightweight Protein Structure Prediction on PCs

  • Yuan Chuxin

DOI
https://doi.org/10.1051/bioconf/202518202014
Journal volume & issue
Vol. 182
p. 02014

Abstract

Read online

Biological sequence design seeks to generate novel sequences, such as proteins, with optimized functional properties, a task complicated by vast combinatorial spaces and complex sequence-function relationships. Traditional offline methods limiting adaptability and long-term performance. This paper introduces a novel online learning approach that integrates pre-trained language models (LMs), such as ESM-2, with gradient based search to dynamically refine a proxy model during optimization. By leveraging real-time updates, our method addresses the static constraints of prior work, achieving significant improvements: 29% faster convergence (600 vs. 850 steps), enhanced proxy accuracy (MSE 1.78 vs. 2.15), and higher sequence quality (fitness 78.9 vs. 72.3), while maintaining diversity (15.7 vs. 15.4). We systematically evaluate key variables—learning rate, update frequency, initial dataset size, and LM type—demonstrating their impact on performance across eight experiments, including long-term optimization up to 10,000 steps (fitness 82.5). The framework’s novelty lies in its hybrid design, combining online learning with a bi-level structure, a fusion underrepresented in the literature. This scalability and adaptability offer practical advantages for protein engineering and synthetic biology, where iterative refinement is essential.