BMC Genomics (Oct 2017)

Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic

  • Xin Bai,
  • Kujin Tang,
  • Jie Ren,
  • Michael Waterman,
  • Fengzhu Sun

DOI
https://doi.org/10.1186/s12864-017-4020-z
Journal volume & issue
Vol. 18, no. S6
pp. 19 – 30

Abstract

Read online

Abstract Background Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ 2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. Results We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r 1 and r 2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r 1,r 2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. Conclusion Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

Keywords