Scientific Reports (Mar 2024)

Clustering analysis for the evolutionary relationships of SARS-CoV-2 strains

  • Xiangzhong Chen,
  • Mingzhao Wang,
  • Xinglin Liu,
  • Wenjie Zhang,
  • Huan Yan,
  • Xiang Lan,
  • Yandi Xu,
  • Sanyi Tang,
  • Juanying Xie

DOI
https://doi.org/10.1038/s41598-024-57001-5
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 14

Abstract

Read online

Abstract To explore the differences and relationships between the available SARS-CoV-2 strains and predict the potential evolutionary direction of these strains, we employ the hierarchical clustering analysis to investigate the evolutionary relationships between the SARS-CoV-2 strains utilizing the genomic sequences collected in China till January 7, 2023. We encode the sequences of the existing SARS-CoV-2 strains into numerical data through k-mer algorithm, then propose four methods to select the representative sample from each type of strains to comprise the dataset for clustering analysis. Three hierarchical clustering algorithms named Ward-Euclidean, Ward-Jaccard, and Average-Euclidean are introduced through combing the Euclidean and Jaccard distance with the Ward and Average linkage clustering algorithms embedded in the OriginPro software. Experimental results reveal that BF.28, BE.1.1.1, BA.5.3, and BA.5.6.4 strains exhibit distinct characteristics which are not observed in other types of SARS-CoV-2 strains, suggesting their being the majority potential sources which the future SARS-CoV-2 strains’ evolution from. Moreover, BA.2.75, CH.1.1, BA.2, BA.5.1.3, BF.7, and B.1.1.214 strains demonstrate enhanced abilities in terms of immune evasion, transmissibility, and pathogenicity. Hence, closely monitoring the evolutionary trends of these strains is crucial to mitigate their impact on public health and society as far as possible.