IEEE Access (Jan 2020)
lncLocPred: Predicting LncRNA Subcellular Localization Using Multiple Sequence Feature Information
Abstract
Determining the subcellular localization of long non-coding RNAs (lncRNAs) provides very favorable references to discover the function of lncRNAs. Instead of through time-consuming and expensive biochemical experiments, we develop a machine learning predictor based on logistic regression, lncLocPred, to predict the subcellular localization of lncRNAs. We adopt sequence features including k-mer, triplet, and PseDNC and systematically process feature selection through VarianceThreshold, binomial distribution, and F-score to obtain representative features. We observe that the top-ranked k-mers have a higher base content of G and C in the form of short repeats. Improving prediction accuracy on several subcellular localizations, our model achieves the highest overall accuracy of 92.37% on the benchmark dataset by jackknife, higher than the existing state-of-the-art predictors. Additionally, lncLocPred performs better for the prediction on an independent dataset collected by us as well. Related experimental data and source code are available at https://github.com/jademyC1221/lncLocPred.
Keywords