CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

Steve Davis; James B. Pettengill; Yan Luo; Justin Payne; Al Shpuntoff; Hugh Rand; Errol Strain

doi:10.7717/peerj-cs.20

PeerJ Computer Science (Aug 2015)

CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

Steve Davis,
James B. Pettengill,
Yan Luo,
Justin Payne,
Al Shpuntoff,
Hugh Rand,
Errol Strain

Affiliations

Steve Davis: Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, USA
James B. Pettengill: Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, USA
Yan Luo: Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, USA
Justin Payne: Division of Microbiology, Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, USA
Al Shpuntoff: Center for Food Safety and Applied Nutrition Scientific Engineering, Engility Corporation at FDA, Food and Drug Administration, College Park, MD, USA
Hugh Rand: Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, USA
Errol Strain: Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, College Park, MD, USA

DOI: https://doi.org/10.7717/peerj-cs.20
Journal volume & issue: Vol. 1
p. e20

Abstract

Read online Read online

The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and difficult task. Here, we present CFSAN SNP Pipeline, which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100× and 20× coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100× dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 × 10−6; for the 20× dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 × 10−7. Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords