CaPSID: A bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes

Borozan Ivan; Wilson Shane; Blanchette Paola; Laflamme Philippe; Watt Stuart N; Krzyzanowski Paul M; Sircoulomb Fabrice; Rottapel Robert; Branton Philip E; Ferretti Vincent

doi:10.1186/1471-2105-13-206

BMC Bioinformatics (Aug 2012)

CaPSID: A bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes

Borozan Ivan,
Wilson Shane,
Blanchette Paola,
Laflamme Philippe,
Watt Stuart N,
Krzyzanowski Paul M,
Sircoulomb Fabrice,
Rottapel Robert,
Branton Philip E,
Ferretti Vincent

Affiliations

Borozan Ivan
Wilson Shane
Blanchette Paola
Laflamme Philippe
Watt Stuart N
Krzyzanowski Paul M
Sircoulomb Fabrice
Rottapel Robert
Branton Philip E
Ferretti Vincent

DOI: https://doi.org/10.1186/1471-2105-13-206
Journal volume & issue: Vol. 13, no. 1
p. 206

Abstract

Read online

Abstract Background It is now well established that nearly 20% of human cancers are caused by infectious agents, and the list of human oncogenic pathogens will grow in the future for a variety of cancer types. Whole tumor transcriptome and genome sequencing by next-generation sequencing technologies presents an unparalleled opportunity for pathogen detection and discovery in human tissues but requires development of new genome-wide bioinformatics tools. Results Here we present CaPSID (Computational Pathogen Sequence IDentification), a comprehensive bioinformatics platform for identifying, querying and visualizing both exogenous and endogenous pathogen nucleotide sequences in tumor genomes and transcriptomes. CaPSID includes a scalable, high performance database for data storage and a web application that integrates the genome browser JBrowse. CaPSID also provides useful metrics for sequence analysis of pre-aligned BAM files, such as gene and genome coverage, and is optimized to run efficiently on multiprocessor computers with low memory usage. Conclusions To demonstrate the usefulness and efficiency of CaPSID, we carried out a comprehensive analysis of both a simulated dataset and transcriptome samples from ovarian cancer. CaPSID correctly identified all of the human and pathogen sequences in the simulated dataset, while in the ovarian dataset CaPSID’s predictions were successfully validated in vitro.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal