Efficient COI barcoding using high throughput single-end 400 bp sequencing

Chentao Yang; Yuxuan Zheng; Shangjin Tan; Guanliang Meng; Wei Rao; Caiqing Yang; David G. Bourne; Paul A. O’Brien; Junqiang Xu; Sha Liao; Ao Chen; Xiaowei Chen; Xinrui Jia; Ai-bing Zhang; Shanlin Liu

doi:10.1186/s12864-020-07255-w

BMC Genomics (Dec 2020)

Efficient COI barcoding using high throughput single-end 400 bp sequencing

Chentao Yang,
Yuxuan Zheng,
Shangjin Tan,
Guanliang Meng,
Wei Rao,
Caiqing Yang,
David G. Bourne,
Paul A. O’Brien,
Junqiang Xu,
Sha Liao,
Ao Chen,
Xiaowei Chen,
Xinrui Jia,
Ai-bing Zhang,
Shanlin Liu

Affiliations

Chentao Yang: BGI-Shenzhen
Yuxuan Zheng: College of Life Sciences, Capital Normal University
Shangjin Tan: BGI-Shenzhen
Guanliang Meng: BGI-Shenzhen
Wei Rao: BGI-Shenzhen
Caiqing Yang: College of Life Sciences, Capital Normal University
David G. Bourne: College of Science and Engineering, James Cook University
Paul A. O’Brien: College of Science and Engineering, James Cook University
Junqiang Xu: BGI-Shenzhen
Sha Liao: BGI-Shenzhen
Ao Chen: BGI-Shenzhen
Xiaowei Chen: BGI-Shenzhen
Xinrui Jia: College of Life Sciences, Capital Normal University
Ai-bing Zhang: College of Life Sciences, Capital Normal University
Shanlin Liu: BGI-Shenzhen

DOI: https://doi.org/10.1186/s12864-020-07255-w
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Background Over the last decade, the rapid development of high-throughput sequencing platforms has accelerated species description and assisted morphological classification through DNA barcoding. However, the current high-throughput DNA barcoding methods cannot obtain full-length barcode sequences due to read length limitations (e.g. a maximum read length of 300 bp for the Illumina’s MiSeq system), or are hindered by a relatively high cost or low sequencing output (e.g. a maximum number of eight million reads per cell for the PacBio’s SEQUEL II system). Results Pooled cytochrome c oxidase subunit I (COI) barcodes from individual specimens were sequenced on the MGISEQ-2000 platform using the single-end 400 bp (SE400) module. We present a bioinformatic pipeline, HIFI-SE, that takes reads generated from the 5′ and 3′ ends of the COI barcode region and assembles them into full-length barcodes. HIFI-SE is written in Python and includes four function modules of filter, assign, assembly and taxonomy. We applied the HIFI-SE to a set of 845 samples (30 marine invertebrates, 815 insects) and delivered a total of 747 fully assembled COI barcodes as well as 70 Wolbachia and fungi symbionts. Compared to their corresponding Sanger sequences (72 sequences available), nearly all samples (71/72) were correctly and accurately assembled, including 46 samples that had a similarity score of 100% and 25 of ca. 99%. Conclusions The HIFI-SE pipeline represents an efficient way to produce standard full-length barcodes, while the reasonable cost and high sensitivity of our method can contribute considerably more DNA barcodes under the same budget. Our method thereby advances DNA-based species identification from diverse ecosystems and increases the number of relevant applications.

Published in BMC Genomics

ISSN: 1471-2164 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Biology (General): Genetics
Website: http://bmcgenomics.biomedcentral.com

About the journal

Abstract

Keywords