Effects of error, chimera, bias, and GC content on the accuracy of amplicon sequencing

Yujia Qin; Liyou Wu; Qiuting Zhang; Chongqin Wen; Joy D. Van Nostrand; Daliang Ning; Lutgarde Raskin; Ameet Pinto; Jizhong Zhou

doi:10.1128/msystems.01025-23

mSystems (Dec 2023)

Effects of error, chimera, bias, and GC content on the accuracy of amplicon sequencing

Yujia Qin,
Liyou Wu,
Qiuting Zhang,
Chongqin Wen,
Joy D. Van Nostrand,
Daliang Ning,
Lutgarde Raskin,
Ameet Pinto,
Jizhong Zhou

Affiliations

Yujia Qin: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA
Liyou Wu: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA
Qiuting Zhang: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA
Chongqin Wen: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA
Joy D. Van Nostrand: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA
Daliang Ning: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA
Lutgarde Raskin: Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, Michigan, USA
Ameet Pinto: School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
Jizhong Zhou: Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA

DOI: https://doi.org/10.1128/msystems.01025-23
Journal volume & issue: Vol. 8, no. 6

Abstract

Read online

ABSTRACT Targeted amplicon sequencing is widely used in microbial ecology studies. However, sequencing artifacts and amplification biases are of great concern. To identify sources of these artifacts, a systematic analysis was performed using mock communities comprised of 16S rRNA genes from 33 bacterial strains. Our results indicated that while sequencing errors were generally isolated to low-abundance operational taxonomic units, chimeric sequences were a major source of artifacts. Singleton and doubleton sequences were primarily chimeras. Formation of chimeric sequences was significantly correlated with the GC content of the targeted sequences. Low-GC-content mock community members exhibited lower rates of chimeric sequence formation. GC content also had a large impact on sequence recovery. The quantitative capacity was notably limited, with substantial recovery variations and weak correlation between anticipated and observed strain abundances. The mock community strains with higher GC content had higher recovery rates than strains with lower GC content. Amplification bias was also observed due to the differences in primer affinity. A two-step PCR strategy reduced the number of chimeric sequences by half. In addition, comparative analyses based on the mock communities showed that several widely used sequence processing pipelines/methods, including DADA2, Deblur, UCLUST, UNOISE, and UPARSE, had different advantages and disadvantages in artifact removal and rare species detection. These results are important for improving sequencing quality and reliability and developing new algorithms to process targeted amplicon sequences.IMPORTANCEAmplicon sequencing of targeted genes is the predominant approach to estimate the membership and structure of microbial communities. However, accurate reconstruction of community composition is difficult due to sequencing errors, and other methodological biases and effective approaches to overcome these challenges are essential. Using a mock community of 33 phylogenetically diverse strains, this study evaluated the effect of GC content on sequencing results and tested different approaches to improve overall sequencing accuracy while characterizing the pros and cons of popular amplicon sequence data processing approaches. The sequencing results from this study can serve as a benchmarking data set for future algorithmic improvements. Furthermore, the new insights on sequencing error, chimera formation, and GC bias from this study will help enhance the quality of amplicon sequencing studies and support the development of new data analysis approaches.

Published in mSystems

ISSN: 2379-5077 (Online)
Publisher: American Society for Microbiology
Country of publisher: United States
LCC subjects: Science: Microbiology
Website: https://journals.asm.org/journal/msystems

About the journal

Abstract

Keywords