Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

Ziyi Zhou; Liang Zhang; Yuanxi Yu; Banghao Wu; Mingchen Li; Liang Hong; Pan Tan

doi:10.1038/s41467-024-49798-6

Nature Communications (Jul 2024)

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

Ziyi Zhou,
Liang Zhang,
Yuanxi Yu,
Banghao Wu,
Mingchen Li,
Liang Hong,
Pan Tan

Affiliations

Ziyi Zhou: School of Physics and Astronomy, Shanghai Jiao Tong University
Liang Zhang: School of Physics and Astronomy, Shanghai Jiao Tong University
Yuanxi Yu: School of Physics and Astronomy, Shanghai Jiao Tong University
Banghao Wu: School of Life Sciences and Biotechnology, Shanghai Jiao Tong University
Mingchen Li: Shanghai Artificial Intelligence Laboratory
Liang Hong: School of Physics and Astronomy, Shanghai Jiao Tong University
Pan Tan: School of Physics and Astronomy, Shanghai Jiao Tong University

DOI: https://doi.org/10.1038/s41467-024-49798-6
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without wet-lab experimental data, but their accuracy and interpretability remain limited. On the other hand, traditional supervised deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity for fitness prediction. By combining meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP’s superiority over both unsupervised and supervised baselines. Furthermore, we successfully apply FSFP to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results underscore the potential of our approach in aiding AI-guided protein engineering.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal