A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations

David J. King; Graham Freimanis; Lidia Lasecka-Dykes; Amin Asfor; Paolo Ribeca; Ryan Waters; Donald P. King; Emma Laing

doi:10.3390/v12101187

Viruses (Oct 2020)

A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations

David J. King,
Graham Freimanis,
Lidia Lasecka-Dykes,
Amin Asfor,
Paolo Ribeca,
Ryan Waters,
Donald P. King,
Emma Laing

Affiliations

David J. King: The Pirbright Institute, Woking, Surrey GU24 0NF, UK
Graham Freimanis: The Pirbright Institute, Woking, Surrey GU24 0NF, UK
Lidia Lasecka-Dykes: The Pirbright Institute, Woking, Surrey GU24 0NF, UK
Amin Asfor: The Pirbright Institute, Woking, Surrey GU24 0NF, UK
Paolo Ribeca: Biomathematics and Statistics Scotland, Edinburgh, Midlothian EH9 3FD, UK
Ryan Waters: The Pirbright Institute, Woking, Surrey GU24 0NF, UK
Donald P. King: The Pirbright Institute, Woking, Surrey GU24 0NF, UK
Emma Laing: Department of Microbial and Cellular Sciences, Faculty of Health and Medical Sciences, School of Biosciences and Medicine, University of Surrey, Guildford GU2 7XH, UK

DOI: https://doi.org/10.3390/v12101187
Journal volume & issue: Vol. 12, no. 10
p. 1187

Abstract

Read online

High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA “populations” were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (107 copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (105 copies) required more technical replicates to maintain accuracy, while low RNA inputs (103 copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants.

Published in Viruses

ISSN: 1999-4915 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Microbiology
Website: http://www.mdpi.com/journal/viruses

About the journal

Abstract

Keywords