Genes (Feb 2020)

Whole-Genome <i>k</i>-mer Topic Modeling Associates Bacterial Families

  • Ernesto Borrayo,
  • Isaias May-Canche,
  • Omar Paredes,
  • J. Alejandro Morales,
  • Rebeca Romo-Vázquez,
  • Hugo Vélez-Pérez

DOI
https://doi.org/10.3390/genes11020197
Journal volume & issue
Vol. 11, no. 2
p. 197

Abstract

Read online

Alignment-free k-mer-based algorithms in whole genome sequence comparisons remain an ongoing challenge. Here, we explore the possibility to use Topic Modeling for organism whole-genome comparisons. We analyzed 30 complete genomes from three bacterial families by topic modeling. For this, each genome was considered as a document and 13-mer nucleotide representations as words. Latent Dirichlet allocation was used as the probabilistic modeling of the corpus. We where able to identify the topic distribution among analyzed genomes, which is highly consistent with traditional hierarchical classification. It is possible that topic modeling may be applied to establish relationships between genome’s composition and biological phenomena.

Keywords