Effective primer design for genotype and subtype detection of highly divergent viruses in large scale genome datasets

Burak Demiralay; Tolga Can

doi:10.1186/s12859-025-06251-9

BMC Bioinformatics (Sep 2025)

Effective primer design for genotype and subtype detection of highly divergent viruses in large scale genome datasets

Burak Demiralay,
Tolga Can

Affiliations

Burak Demiralay: Department of Health Informatics, Informatics Institute, Middle East Technical University
Tolga Can: Department of Computer Science, Colorado School of Mines

DOI: https://doi.org/10.1186/s12859-025-06251-9
Journal volume & issue: Vol. 26, no. 1
pp. 1 – 19

Abstract

Read online

Abstract Identification of microorganisms in a biological sample is a crucial step in diagnostics, pathogen screening, biomedical research, evolutionary studies, agriculture, and biological threat assessment. While progress has been made in studying larger organisms, there is a need for an efficient and scalable method that can handle thousands of whole genomes for organisms with high mutation rates and genetic diversity such as single stranded viruses. In this study, we developed a novel method to identify subsequences for detection of a given species/subspecies in a (meta)genomic sample using the Polymerase Chain Reaction (PCR) method. Species detection in any analysis depends highly on the measurement method and since thermodynamic interactions are critical in PCR, thermodynamics is the main driving force in the proposed methodology. Our method is parallelized in multiple steps and involves extracting all oligonucleotides from target genomes. We then locate the target sites for each oligonucleotide using the constructed suffix array and local alignment followed by thermodynamic interaction assessment. An important requirement for subspecies identification is to avoid amplifying a non-target set of genomes and our method addresses this. We applied our method to three highly divergent viruses; (1) Hepatitis C virus (HCV), where the subtypes differ in 31–33% of nucleotide sites on average, (2) Human immunodeficiency virus (HIV), for which, 25–35% between-subtype and 15–20% within-subtype variation is observed, and (3) the Dengue virus, whose respective genomes (only DENV 1–4) share 60% sequence identity to each other. Using our method, we were able to select oligonucleotides that can identify in silico 99.9% of 1657 HCV genomes, 99.7% of 11,838 HIV genomes, and 95.4% of 4016 Dengue genomes. We also show subspecies identification on genotypes 1–6 of HCV and genotypes 1–4 of the Dengue virus with more than 99.5% true positive and less than 0.05% false positive rate, on average. None of the state-of-the-art methods can produce oligonucleotides with this specificity and sensitivity on highly divergent viral genomes like the ones studied in this article.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords