Results in Engineering (Dec 2024)
Ensemble deep learning model for protein secondary structure prediction using NLP metrics and explainable AItest_secondary_structure_casp12
Abstract
Proteins are a major part of all living organisms, consisting of chains made from amino acids, and they perform various biological processes such as DNA replication, metabolism, or forming cell structure. Detecting accurate protein structures is fundamental to understanding their functions. Advances in deep learning (DL), artificial intelligence, and similar technologies have transformed computational biology. AlphaFold and RoseTTAFold are examples of AI models greatly outperforming traditional methods to predict 3D protein structures with very high accuracy. This study builds on the progress, using an ensemble deep learning model for protein secondary structure prediction. When it came to training and validation, the ensemble model outscored the individual models with the best accuracy (94.41%) and the lowest validation loss (0.1585). Since it maintains the structural integrity of protein sequences, the ROUGE-L score, which is widely used in natural language processing (NLP), works well for protein sequence evaluation. This integrates NLP methods with bioinformatics. Model predictions were made more interpretable by utilizing explainable AI approaches like integrated gradients and LIME, which shed light on the characteristics that affect protein shape. By using this method, biases are lessened, and a greater comprehension of the biological connections between protein shapes and sequences is provided.