Clustering huge protein sequence sets in linear time

Martin Steinegger; Johannes Söding

doi:10.1038/s41467-018-04964-5

Nature Communications (Jun 2018)

Clustering huge protein sequence sets in linear time

Martin Steinegger,
Johannes Söding

Affiliations

Martin Steinegger: Quantitative and Computational Biology group, Max-Planck Institute for Biophysical Chemistry
Johannes Söding: Quantitative and Computational Biology group, Max-Planck Institute for Biophysical Chemistry

DOI: https://doi.org/10.1038/s41467-018-04964-5
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 8

Abstract

Read online

Billions of metagenomic and genomic sequences fill up public datasets, which makes similarity clustering an important and time-critical analysis step. Here, the authors develop Linclust, an algorithm with linear time complexity that can cluster over a billion sequences within hours on a single server.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal