PLoS Computational Biology (Mar 2006)

A third approach to gene prediction suggests thousands of additional human transcribed regions.

  • Gustavo Glusman,
  • Shizhen Qin,
  • M Raafat El-Gewely,
  • Andrew F Siegel,
  • Jared C Roach,
  • Leroy Hood,
  • Arian F A Smit

DOI
https://doi.org/10.1371/journal.pcbi.0020018
Journal volume & issue
Vol. 2, no. 3
p. e18

Abstract

Read online

The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent "genomic deserts."