A bioinformatic pipeline for simulating viral integration data

Suzanne Scott; Susanna Grigson; Felix Hartkopf; Claus V. Hallwirth; Ian E. Alexander; Denis C. Bauer; Laurence O.W. Wilson

Data in Brief (Jun 2022)

A bioinformatic pipeline for simulating viral integration data

Suzanne Scott,
Susanna Grigson,
Felix Hartkopf,
Claus V. Hallwirth,
Ian E. Alexander,
Denis C. Bauer,
Laurence O.W. Wilson

Affiliations

Suzanne Scott: Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, North Ryde, Australia; Gene Therapy Research Unit, Children's Medical Research Institute, Westmead, Australia; The Sydney Children's Hospitals Network, Faculty of Medicine and Health, The University of Sydney, Westmead, Australia
Susanna Grigson: College of Science and Engineering, Flinders University, Adelaide, Australia
Felix Hartkopf: Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
Claus V. Hallwirth: Gene Therapy Research Unit, Children's Medical Research Institute, Westmead, Australia; The Sydney Children's Hospitals Network, Faculty of Medicine and Health, The University of Sydney, Westmead, Australia
Ian E. Alexander: Gene Therapy Research Unit, Children's Medical Research Institute, Westmead, Australia; The Sydney Children's Hospitals Network, Faculty of Medicine and Health, The University of Sydney, Westmead, Australia; Discipline of Child and Adolescent Health, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, SA
Denis C. Bauer: Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, North Ryde, Australia; Macquarie University, Department of Biomedical Sciences, Faculty of Medicine and Health Science, Macquarie Park, SA; Macquarie University, Applied BioSciences, Faculty of Science and Engineering, Macquarie Park, SA; Corresponding author.
Laurence O.W. Wilson: Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, North Ryde, Australia; Macquarie University, Applied BioSciences, Faculty of Science and Engineering, Macquarie Park, SA; Corresponding author.

Journal volume & issue: Vol. 42
p. 108161

Abstract

Read online

Viral integration is a complex biological process, and it is useful to have a reference integration dataset with known properties to compare experimental data against, or for comparing with the results from computational tools that detect integration. To generate these data, we developed a pipeline for simulating integrations of a viral or vector genome into a host genome. Our method reproduces more complex characteristics of vector and viral integration, including integration of sub-genomic fragments, structural variation of the integrated genomes, and deletions from the host genome at the integration site. Our method [1] takes the form of a snakemake [2] pipeline, consisting of a Python [3] script using the Biopython [4] module that simulates integrations of a viral reference into a host reference. This produces a reference containing integrations, from which sequencing reads are simulated using ART [5]. The IDs of the reads crossing integration junctions are then annotated using another python script to produce the final output, consisting of the simulated reads and a table of the locations of those integrations and the reads crossing each integration junction. To illustrate our method, we provide simulated reads, integration locations, as well as the code required to simulate integrations using any virus and host reference. This simulation method was used to investigate the performance of viral integration tools in our research [6].

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords