Rapid identification of novel protein families using similarity searches [version 1; peer review: 2 approved]

Matt Jeffryes; Alex Bateman

doi:10.12688/f1000research.17315.1

F1000Research (Dec 2018)

Rapid identification of novel protein families using similarity searches [version 1; peer review: 2 approved]

Matt Jeffryes,
Alex Bateman

Affiliations

Matt Jeffryes: European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
Alex Bateman: European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK

DOI: https://doi.org/10.12688/f1000research.17315.1
Journal volume & issue: Vol. 7

Abstract

Read online

Protein family databases are an important tool for biologists trying to dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating a protein family database. This comparison helps to understand whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive comparisons. This method is based upon the MinHash algorithm, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing families in Pfam in less than a second, with little loss in accuracy.

Published in F1000Research

ISSN: 2046-1402 (Online)
Publisher: F1000 Research Ltd
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://f1000research.com

About the journal