Viruses (Mar 2024)
Predicting Natural Evolution in the RBD Region of the Spike Glycoprotein of SARS-CoV-2 by Machine Learning
Abstract
Machine learning (ML) is a key focus in predicting protein mutations and aiding directed evolution. Research on potential virus variants is crucial for vaccine development. In this study, the machine learning software PyPEF was employed to conduct mutation analysis within the receptor-binding domain (RBD) of the Spike glycoprotein of SARS-CoV-2. Over 48,960,000 variants were predicted. Eight prospective variants that could surface in the future underwent modeling and molecular dynamics simulations. The study forecasts that the latest variant, ISOY2P5O1, may potentially emerge around 17 November 2023, with an approximate window of uncertainty of ±22 days. The ISOY8P5O2 variant displayed an increased binding capacity in the dry assay, with a total predicted binding energy of −110.306 kcal/mol. This represents an 8.25% enhancement in total binding energy compared to the original SARS-CoV-2 strain discovered in Wuhan (−101.892 kcal/mol). Reverse research confirmed the structural significance of mutation sites using ML models, particularly in the context of protein folding. The study validated regression methods (SVR, RF, and PLS) with different data structures. This study investigates the effectiveness of the “ML-Guided Design Correctly Predicts Combinatorial Effects Strategy” compared to the “ML-Guided Design Correctly Predicts Natural Evolution Prediction Strategy”. To enhance machine learning, we created a timestamping algorithm and two auxiliary programs using advanced techniques to rapidly process extensive data, surpassing batch sequencing capabilities. This study not only advances machine learning in guiding protein evolution but also holds potential for forecasting future viruses and vaccine development.
Keywords