KmerAperture: Retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes.

Matthew P Moore; Mirjam Laager; Paolo Ribeca; Xavier Didelot

doi:10.1371/journal.pgen.1011184

PLoS Genetics (Apr 2024)

KmerAperture: Retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes.

Matthew P Moore,
Mirjam Laager,
Paolo Ribeca,
Xavier Didelot

Affiliations

Matthew P Moore
Mirjam Laager
Paolo Ribeca
Xavier Didelot

DOI: https://doi.org/10.1371/journal.pgen.1011184
Journal volume & issue: Vol. 20, no. 4
p. e1011184

Abstract

Read online

By decomposing genome sequences into k-mers, it is possible to estimate genome differences without alignment. Techniques such as k-mer minimisers, for example MinHash, have been developed and are often accurate approximations of distances based on full k-mer sets. These and other alignment-free methods avoid the large temporal and computational expense of alignment. However, these k-mer set comparisons are not entirely accurate within-species and can be completely inaccurate within-lineage. This is due, in part, to their inability to distinguish core polymorphism from accessory differences. Here we present a new approach, KmerAperture, which uses information on the k-mer relative genomic positions to determine the type of polymorphism causing differences in k-mer presence and absence between pairs of genomes. Single SNPs are expected to result in k unique contiguous k-mers per genome. On the other hand, contiguous series > k may be caused by accessory differences of length S-k+1; when the start and end of the sequence are contiguous with homologous sequence. Alternatively, they may be caused by multiple SNPs within k bp from each other and KmerAperture can determine whether that is the case. To demonstrate use cases KmerAperture was benchmarked using datasets including a very low diversity simulated population with accessory content independent from the number of SNPs, a simulated population where SNPs are spatially dense, a moderately diverse real cluster of genomes (Escherichia coli ST1193) with a large accessory genome and a low diversity real genome cluster (Salmonella Typhimurium ST34). We show that KmerAperture can accurately distinguish both core and accessory sequence diversity without alignment, outperforming other k-mer based tools.

Published in PLoS Genetics

ISSN: 1553-7390 (Print); 1553-7404 (Online)
Publisher: Public Library of Science (PLoS)
Country of publisher: United States
LCC subjects: Science: Biology (General): Genetics
Website: https://journals.plos.org/plosgenetics/

About the journal