Nature Communications (Jun 2018)

Clustering huge protein sequence sets in linear time

  • Martin Steinegger,
  • Johannes Söding

DOI
https://doi.org/10.1038/s41467-018-04964-5
Journal volume & issue
Vol. 9, no. 1
pp. 1 – 8

Abstract

Read online

Billions of metagenomic and genomic sequences fill up public datasets, which makes similarity clustering an important and time-critical analysis step. Here, the authors develop Linclust, an algorithm with linear time complexity that can cluster over a billion sequences within hours on a single server.