Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic

Xin Bai; Kujin Tang; Jie Ren; Michael Waterman; Fengzhu Sun

doi:10.1186/s12864-017-4020-z

BMC Genomics (Oct 2017)

Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic

Xin Bai,
Kujin Tang,
Jie Ren,
Michael Waterman,
Fengzhu Sun

Affiliations

Xin Bai: Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University
Kujin Tang: Molecular and Computational Biology Program, University of Southern California
Jie Ren: Molecular and Computational Biology Program, University of Southern California
Michael Waterman: Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University
Fengzhu Sun: Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University

DOI: https://doi.org/10.1186/s12864-017-4020-z
Journal volume & issue: Vol. 18, no. S6
pp. 19 – 30

Abstract

Read online

Abstract Background Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ 2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. Results We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r 1 and r 2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r 1,r 2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. Conclusion Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

Published in BMC Genomics

ISSN: 1471-2164 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Biology (General): Genetics
Website: http://bmcgenomics.biomedcentral.com

About the journal

Abstract

Keywords