Effect of the sequence data deluge on the performance of methods for detecting protein functional residues

Diego Garrido-Martín; Florencio Pazos

doi:10.1186/s12859-018-2084-7

BMC Bioinformatics (Feb 2018)

Effect of the sequence data deluge on the performance of methods for detecting protein functional residues

Diego Garrido-Martín,
Florencio Pazos

Affiliations

Diego Garrido-Martín: Present address: Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology
Florencio Pazos: Computational Systems Biology Group, Systems Biology Program, National Centre for Biotechnology (CNB-CSIC)

DOI: https://doi.org/10.1186/s12859-018-2084-7
Journal volume & issue: Vol. 19, no. 1
pp. 1 – 9

Abstract

Read online

Abstract Background The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. Results In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. Conclusions These results are informative for the methods’ developers and final users, and may have implications in the design of new sequencing initiatives.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords