Contamination Survey of Insect Genomic and Transcriptomic Data
Jiali Zhou,
Xinrui Zhang,
Yujie Wang,
Haoxian Liang,
Yuhao Yang,
Xiaolei Huang,
Jun Deng
Affiliations
Jiali Zhou
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Xinrui Zhang
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Yujie Wang
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Haoxian Liang
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Yuhao Yang
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Xiaolei Huang
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
Jun Deng
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou 350002, China
The rapid advancement of high-throughput sequencing has led to a great increase in sequencing data, resulting in a significant accumulation of contamination, for example, sequences from non-target species may be present in the target species’ sequencing data. Insecta, the most diverse group within Arthropoda, still lacks a comprehensive evaluation of contamination prevalence in public databases and an analysis of potential contamination causes. In this study, COI barcodes were used to investigate contamination from insects and mammals in GenBank’s genomic and transcriptomic data across four insect orders. Among the 2796 WGS and 1382 TSA assemblies analyzed, contamination was detected in 32 (1.14%) WGS and 152 (11.0%) TSA assemblies. Key findings from this study include the following: (1) TSA data exhibited more severe contamination than WGS data; (2) contamination levels varied significantly among the four orders, with Hemiptera showing 9.22%, Coleoptera 3.48%, Hymenoptera 7.66%, and Diptera 1.89% contamination rates; (3) possible causes of contamination, such as food, parasitism, sample collection, and cross-contamination, were analyzed. Overall, this study proposes a workflow for checking the existence of contamination in WGS and TSA data and some suggestions to mitigate it.