Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

Corentin Meyer; Nicolas Scalzitti; Anne Jeannin-Girardon; Pierre Collet; Olivier Poch; Julie D. Thompson

doi:10.1186/s12859-020-03855-1

BMC Bioinformatics (Nov 2020)

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

Corentin Meyer,
Nicolas Scalzitti,
Anne Jeannin-Girardon,
Pierre Collet,
Olivier Poch,
Julie D. Thompson

Affiliations

Corentin Meyer: Department of Computer Science, ICube, CNRS, University of Strasbourg
Nicolas Scalzitti: Department of Computer Science, ICube, CNRS, University of Strasbourg
Anne Jeannin-Girardon: Department of Computer Science, ICube, CNRS, University of Strasbourg
Pierre Collet: Department of Computer Science, ICube, CNRS, University of Strasbourg
Olivier Poch: Department of Computer Science, ICube, CNRS, University of Strasbourg
Julie D. Thompson: Department of Computer Science, ICube, CNRS, University of Strasbourg

DOI: https://doi.org/10.1186/s12859-020-03855-1
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords