mSystems (Jul 2024)
Average nucleotide identity-based Staphylococcus aureus strain grouping allows identification of strain-specific genes in the pangenome
Abstract
ABSTRACT Staphylococcus aureus causes both hospital- and community-acquired infections in humans worldwide. Due to the high incidence of infection, S. aureus is also one of the most sampled and sequenced pathogens today, providing an outstanding resource to understand variation at the bacterial subspecies level. We processed and downsampled 83,383 public S. aureus Illumina whole-genome shotgun sequences and 1,263 complete genomes to produce 7,954 representative substrains. Pairwise comparison of average nucleotide identity revealed a natural boundary of 99.5% that could be used to define 145 distinct strains within the species. We found that intermediate frequency genes in the pangenome (present in 10%–95% of genomes) could be divided into those closely linked to strain background (“strain-concentrated”) and those highly variable within strains (“strain-diffuse”). Non-core genes had different patterns of chromosome location. Notably, strain-diffuse genes were associated with prophages; strain-concentrated genes were associated with the vSaβ genome island and rare genes (84,000 genomes and subsampled to remove redundancy. We found that individual samples sharing >99.5% of their genome could be grouped into strains. We also showed that a portion of genes that are present in intermediate frequency in the species are strongly associated with some strains but completely absent from others, suggesting a role in strain specificity. This work lays the foundation for understanding individual gene histories of the S. aureus species and also outlines strategies for processing large bacterial genomic data sets.
Keywords