Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history

Kaiyue Zhou; Jiaxin Huo; Caixia Gao; Xu Wang; Pengfei Xu; Jiahuan Hou; Wenying Guo; Tao Sun; Lin Da

doi:10.1038/s41598-023-31013-z

Scientific Reports (Mar 2023)

Applying T-classifier, binary classifiers, upon high-throughput TCR sequencing output to identify cytomegalovirus exposure history

Kaiyue Zhou,
Jiaxin Huo,
Caixia Gao,
Xu Wang,
Pengfei Xu,
Jiahuan Hou,
Wenying Guo,
Tao Sun,
Lin Da

Affiliations

Kaiyue Zhou: Department of Mathematics, School of Mathematical Sciences, Inner Mongolia University
Jiaxin Huo: Department of Mathematics, School of Mathematical Sciences, Inner Mongolia University
Caixia Gao: Department of Mathematics, School of Mathematical Sciences, Inner Mongolia University
Xu Wang: Department of Mathematics, School of Mathematical Sciences, Inner Mongolia University
Pengfei Xu: Hangzhou ImmuQuad Biotechnologies
Jiahuan Hou: Hangzhou ImmuQuad Biotechnologies
Wenying Guo: Hangzhou ImmuQuad Biotechnologies
Tao Sun: Hangzhou ImmuQuad Biotechnologies
Lin Da: Department of Mathematics, School of Mathematical Sciences, Inner Mongolia University

DOI: https://doi.org/10.1038/s41598-023-31013-z
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 8

Abstract

Read online

Abstract With the continuous development of information technology and the running speed of computers, the development of informatization has led to the generation of increasingly more medical data. Solving unmet needs such as employing the constantly developing artificial intelligence technology to medical data and providing support for the medical industry is a hot research topic. Cytomegalovirus (CMV) is a kind of virus that exists widely in nature with strict species specificity, and the infection rate among Chinese adults is more than 95%. Therefore, the detection of CMV is of great importance since the vast majority of infected patients are in a state of invisible infection after the infection, except for a few patients with clinical symptoms. In this study, we present a new method to detect CMV infection status by analyzing high-throughput sequencing results of T cell receptor beta chains (TCRβ). Based on the high-throughput sequencing data of 640 subjects from cohort 1, Fisher’s exact test was performed to evaluate the relationship between TCRβ sequences and CMV status. Furthermore, the number of subjects with these correlated sequences to different degrees in cohort 1 and cohort 2 were measured to build binary classifier models to identify whether the subject was CMV positive or negative. We select four binary classification algorithms: logistic regression (LR), support vector machine (SVM), random forest (RF), and linear discriminant analysis (LDA) for side-by-side comparison. According to the performance of different algorithms corresponding to different thresholds, four optimal binary classification algorithm models are obtained. The logistic regression algorithm performs best when Fisher's exact test threshold is 10−5, and the sensitivity and specificity are 87.5% and 96.88%, respectively. The RF algorithm performs better at the threshold of 10−5, with a sensitivity of 87.5% and a specificity of 90.63%. The SVM algorithm also achieves high accuracy at the threshold value of 10−5, with a sensitivity of 85.42% and specificity of 96.88%. The LDA algorithm achieves high accuracy with 95.83% sensitivity and 90.63% specificity when the threshold value is 10−4. This is probably because the two-dimensional distribution of CMV data samples is linearly separable, and linear division models such as LDA are more effective, while the division effect of nonlinear separable algorithms such as random forest is relatively inaccurate. This new finding may be a potential diagnostic method for CMV and may even be applicable to other viruses, such as the infectious history detection of the new coronavirus.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal