IEEE Access (Jan 2021)
Is BCH Code Useful to DNA Classification as an Alignment-Free Method?
Abstract
Similarities between biological and digital communication systems have been investigated since biology also uses a discrete alphabet to represent and transmit information. The genetic information of an organism is encoded in DNA molecules by units called bases. However, there is no a definitive model and the question as what error-correcting code underlies DNA sequences remains an open problem. Recent works show that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as BCH codes. We propose improvements regarding the code construction process that resulted in a novel algorithm for searching BCH codes whose codeword differ from a given DNA sequence (mapped to finite field $\mathbb {F}_{4}$ ) in up to only one symbol. The most important improvement is to replace brute force decoding with syndrome decoding. In this sense, based on a statistical analysis, we verify whether in a collection of sequences with the same taxonomic rank there is a code that identifies most of these sequences, called dominant code. Furthermore, we check whether the dominant code can provides a biological information to DNA classification being an alignment-free method. Finally, we show that the probability of a DNA sequences with odd-length $n$ be identified by a BCH code tends to analytical probability of the same code identifying a random vector.
Keywords