Discarding duplicate ditags in LongSAGE analysis may introduce significant error

Hahn Stephan A; Høgh Annabeth; Heidenblut Anna M; Emmersen Jeppe; Welinder Karen G; Nielsen Kåre L

doi:10.1186/1471-2105-8-92

BMC Bioinformatics (Mar 2007)

Discarding duplicate ditags in LongSAGE analysis may introduce significant error

Hahn Stephan A,
Høgh Annabeth,
Heidenblut Anna M,
Emmersen Jeppe,
Welinder Karen G,
Nielsen Kåre L

Affiliations

Hahn Stephan A
Høgh Annabeth
Heidenblut Anna M
Emmersen Jeppe
Welinder Karen G
Nielsen Kåre L

DOI: https://doi.org/10.1186/1471-2105-8-92
Journal volume & issue: Vol. 8, no. 1
p. 92

Abstract

Read online

Abstract Background During gene expression analysis by Serial Analysis of Gene Expression (SAGE), duplicate ditags are routinely removed from the data analysis, because they are suspected to stem from artifacts during SAGE library construction. As a consequence, naturally occurring duplicate ditags are also removed from the analysis leading to an error of measurement. Results An algorithm was developed to analyze the differential occurrence of SAGE tags in different ditag combinations. Analysis of a pancreatic acinar cell LongSAGE library showed no sign of a general amplification bias that justified the removal of all duplicate ditags. Extending the analysis to 10 additional LongSAGE libraries showed no justification for removal of all duplicate ditags either. On the contrary, while the error introduced in original SAGE by removal of naturally occurring duplicate ditags is insignificant, it leads to an error of up to 3 fold in LongSAGE. However, the algorithm developed for the analysis of duplicate ditags was able to identify individual artifact ditags that originated from rare nucleotide variations of tags and vector contamination. Conclusion The removal of all duplicate ditags was unfounded for the datasets analyzed and led to large errors. This may also be the case for other LongSAGE datasets already present in databases. Analysis of the ditag population, however, can identify artifact tags that should be removed from analysis or have their tag count adjusted.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal