Moving Just Enough Deep Sequencing Data to Get the Job Done

Nicholas Mills; Ethan M Bensman; William L Poehlman; Walter B Ligon; F Alex Feltus

doi:10.1177/1177932219856359

Bioinformatics and Biology Insights (Jun 2019)

Moving Just Enough Deep Sequencing Data to Get the Job Done

Nicholas Mills,
Ethan M Bensman,
William L Poehlman,
Walter B Ligon,
F Alex Feltus

Affiliations

Nicholas Mills: Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA
Ethan M Bensman: School of Computing, Clemson University, Clemson, SC, USA
William L Poehlman: Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
Walter B Ligon: Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA
F Alex Feltus: Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA

DOI: https://doi.org/10.1177/1177932219856359
Journal volume & issue: Vol. 13

Abstract

Read online

Motivation: As the size of high-throughput DNA sequence datasets continues to grow, the cost of transferring and storing the datasets may prevent their processing in all but the largest data centers or commercial cloud providers. To lower this cost, it should be possible to process only a subset of the original data while still preserving the biological information of interest. Results: Using 4 high-throughput DNA sequence datasets of differing sequencing depth from 2 species as use cases, we demonstrate the effect of processing partial datasets on the number of detected RNA transcripts using an RNA-Seq workflow. We used transcript detection to decide on a cutoff point. We then physically transferred the minimal partial dataset and compared with the transfer of the full dataset, which showed a reduction of approximately 25% in the total transfer time. These results suggest that as sequencing datasets get larger, one way to speed up analysis is to simply transfer the minimal amount of data that still sufficiently detects biological signal. Availability: All results were generated using public datasets from NCBI and publicly available open source software.

Published in Bioinformatics and Biology Insights

ISSN: 1177-9322 (Online)
Publisher: SAGE Publishing
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General)
Website: https://journals.sagepub.com/home/bbi

About the journal