phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R

Dominic  J. Bennett; Hannes Hettling; Daniele Silvestro; Alexander Zizka; Christine  D. Bacon; Søren Faurby; Rutger  A. Vos; Alexandre Antonelli

doi:10.3390/life8020020

Life (Jun 2018)

phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R

Dominic J. Bennett,
Hannes Hettling,
Daniele Silvestro,
Alexander Zizka,
Christine D. Bacon,
Søren Faurby,
Rutger A. Vos,
Alexandre Antonelli

Affiliations

Dominic J. Bennett: Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden
Hannes Hettling: Naturalis Biodiversity Center, P.O. Box 9517, 2300 RA Leiden, The Netherlands
Daniele Silvestro: Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden
Alexander Zizka: Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden
Christine D. Bacon: Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden
Søren Faurby: Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden
Rutger A. Vos: Naturalis Biodiversity Center, P.O. Box 9517, 2300 RA Leiden, The Netherlands
Alexandre Antonelli: Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden

DOI: https://doi.org/10.3390/life8020020
Journal volume & issue: Vol. 8, no. 2
p. 20

Abstract

Read online

The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available can pose considerable barriers to their usefulness. A key issue in this domain is the high frequency of sequence mislabeling encountered when searching for suitable sequences for phylogenetic analysis. These issues include, among others, the incorrect identification of sequenced species, non-standardized and ambiguous sequence annotation, and the inadvertent addition of paralogous sequences by users. Taken together, these issues likely add considerable noise, error or bias to phylogenetic inference, a risk that is likely to increase with the size of phylogenies or the molecular datasets used to generate them. Here we present a software package, phylotaR that bypasses the above issues by using instead an alignment search tool to identify orthologous sequences. Our package builds on the framework of its predecessor, PhyLoTa, by providing a modular pipeline for identifying overlapping sequence clusters using up-to-date GenBank data and providing new features, improvements and tools. We demonstrate and test our pipeline’s effectiveness by presenting trees generated from phylotaR clusters for two large taxonomic clades: Palms and primates. Given the versatility of this package, we hope that it will become a standard tool for any research aiming to use GenBank data for phylogenetic analysis.

Published in Life

ISSN: 2075-1729 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/life

About the journal

Abstract

Keywords