KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis

Natapol Pornputtapong; Daniel A. Acheampong; Daniel A. Acheampong; Preecha Patumcharoenpol; Piroon Jenjaroenpun; Thidathip Wongsurawat; Se-Ran Jun; Suganya Yongkiettrakul; Nipa Chokesajjawatee; Intawat Nookaew

doi:10.3389/fbioe.2020.556413

Frontiers in Bioengineering and Biotechnology (Sep 2020)

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis

Natapol Pornputtapong,
Daniel A. Acheampong,
Daniel A. Acheampong,
Preecha Patumcharoenpol,
Piroon Jenjaroenpun,
Thidathip Wongsurawat,
Se-Ran Jun,
Suganya Yongkiettrakul,
Nipa Chokesajjawatee,
Intawat Nookaew

Affiliations

Natapol Pornputtapong: Department of Biochemistry and Microbiology, Faculty of Pharmaceutical Sciences, and Research Unit of DNA Barcoding of Thai Medicinal Plants, Chulalongkorn University, Bangkok, Thailand
Daniel A. Acheampong: Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
Daniel A. Acheampong: Joint Graduate Program in Bioinformatics, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, AR, United States
Preecha Patumcharoenpol: Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
Piroon Jenjaroenpun: Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
Thidathip Wongsurawat: Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
Se-Ran Jun: Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
Suganya Yongkiettrakul: National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
Nipa Chokesajjawatee: National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
Intawat Nookaew: Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States

DOI: https://doi.org/10.3389/fbioe.2020.556413
Journal volume & issue: Vol. 8

Abstract

Read online

Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.

Published in Frontiers in Bioengineering and Biotechnology

ISSN: 2296-4185 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology: Biotechnology
Website: http://www.frontiersin.org/bioengineering_and_biotechnology

About the journal

Abstract

Keywords