Entropy (Mar 2015)

Analysis of Data Complexity in Human DNA for Gene-Containing Zone Prediction

  • Ricardo E. Monge,
  • Juan L. Crespo

DOI
https://doi.org/10.3390/e17041673
Journal volume & issue
Vol. 17, no. 4
pp. 1673 – 1689

Abstract

Read online

This study delves further into the analysis of genomic data by computing a variety of complexity measures. We analyze the effect of window size and evaluate the precision and recall of the prediction of gene zones, aided with a much larger dataset (full chromosomes). A technique based on the separation of two cases (gene-containing and non-gene-containing) has been developed as a basic gene predictor for automated DNA analysis. This predictor was tested on various sequences of human DNA obtained from public databases, in a set of three experiments. The first one covers window size and other parameters; the second one corresponds to an analysis of a full human chromosome (198 million nucleic acids); and the last one tests subject variability (with five different individual subjects). All three experiments have high-quality results, in terms of recall and precision, thus indicating the effectiveness of the predictor.

Keywords