TCR-H: explainable machine learning prediction of T-cell receptor epitope binding on unseen datasets

Rajitha Rajeshwar T.; Rajitha Rajeshwar T.; Rajitha Rajeshwar T.; Omar N. A. Demerdash; Omar N. A. Demerdash; Jeremy C. Smith; Jeremy C. Smith; Jeremy C. Smith

doi:10.3389/fimmu.2024.1426173

Frontiers in Immunology (Aug 2024)

TCR-H: explainable machine learning prediction of T-cell receptor epitope binding on unseen datasets

Rajitha Rajeshwar T.,
Rajitha Rajeshwar T.,
Rajitha Rajeshwar T.,
Omar N. A. Demerdash,
Omar N. A. Demerdash,
Jeremy C. Smith,
Jeremy C. Smith,
Jeremy C. Smith

Affiliations

Rajitha Rajeshwar T.: UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, TN, United States
Rajitha Rajeshwar T.: Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, TN, United States
Rajitha Rajeshwar T.: Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States
Omar N. A. Demerdash: UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, TN, United States
Omar N. A. Demerdash: Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States
Jeremy C. Smith: UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, TN, United States
Jeremy C. Smith: Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, TN, United States
Jeremy C. Smith: Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States

DOI: https://doi.org/10.3389/fimmu.2024.1426173
Journal volume & issue: Vol. 15

Abstract

Read online

Artificial-intelligence and machine-learning (AI/ML) approaches to predicting T-cell receptor (TCR)-epitope specificity achieve high performance metrics on test datasets which include sequences that are also part of the training set but fail to generalize to test sets consisting of epitopes and TCRs that are absent from the training set, i.e., are ‘unseen’ during training of the ML model. We present TCR-H, a supervised classification Support Vector Machines model using physicochemical features trained on the largest dataset available to date using only experimentally validated non-binders as negative datapoints. TCR-H exhibits an area under the curve of the receiver-operator characteristic (AUC of ROC) of 0.87 for epitope ‘hard splitting’ (i.e., on test sets with all epitopes unseen during ML training), 0.92 for TCR hard splitting and 0.89 for ‘strict splitting’ in which neither the epitopes nor the TCRs in the test set are seen in the training data. Furthermore, we employ the SHAP (Shapley additive explanations) eXplainable AI (XAI) method for post hoc interrogation to interpret the models trained with different hard splits, shedding light on the key physiochemical features driving model predictions. TCR-H thus represents a significant step towards general applicability and explainability of epitope:TCR specificity prediction.

Published in Frontiers in Immunology

ISSN: 1664-3224 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Internal medicine: Specialties of internal medicine: Immunologic diseases. Allergy
Website: http://journal.frontiersin.org/journal/immunology

About the journal

Abstract

Keywords