A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Ying Zeng; Hongjie Yuan; Zheming Yuan; Yuan Chen

doi:10.1186/s13062-019-0236-y

Biology Direct (Apr 2019)

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Ying Zeng,
Hongjie Yuan,
Zheming Yuan,
Yuan Chen

Affiliations

Ying Zeng: Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University
Hongjie Yuan: Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University
Zheming Yuan: Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University
Yuan Chen: Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University

DOI: https://doi.org/10.1186/s13062-019-0236-y
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. Results Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. Conclusions Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. Reviewers This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.

Published in Biology Direct

ISSN: 1745-6150 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General)
Website: https://biologydirect.biomedcentral.com/

About the journal

Abstract

Keywords