BMC Bioinformatics (Jul 2021)

Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana

  • Daishin Ueno,
  • Harunori Kawabe,
  • Shotaro Yamasaki,
  • Taku Demura,
  • Ko Kato

DOI
https://doi.org/10.1186/s12859-021-04291-5
Journal volume & issue
Vol. 22, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Background RNA degradation is important for the regulation of gene expression. Despite the identification of proteins and sequences related to deadenylation-dependent RNA degradation in plants, endonucleolytic cleavage-dependent RNA degradation has not been studied in detail. Here, we developed truncated RNA end sequencing in Arabidopsis thaliana to identify cleavage sites and evaluate the efficiency of cleavage at each site. Although several features are related to RNA cleavage efficiency, the effect of each feature on cleavage efficiency has not been evaluated by considering multiple putative determinants in A. thaliana. Results Cleavage site information was acquired from a previous study, and cleavage efficiency at the site level (CSsite value), which indicates the number of reads at each cleavage site normalized to RNA abundance, was calculated. To identify features related to cleavage efficiency at the site level, multiple putative determinants (features) were used to perform feature selection using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model. The results indicated that whole RNA features were important for the CSsite value, in addition to features around cleavage sites. Whole RNA features related to the translation process and nucleotide frequency around cleavage sites were major determinants of cleavage efficiency. The results were verified in a model constructed using only sequence features, which showed that the prediction accuracy was similar to that determined using all features including the translation process, suggesting that cleavage efficiency can be predicted using only sequence information. The LASSO regression model was validated in exogenous genes, which showed that the model constructed using only sequence information can predict cleavage efficiency in both endogenous and exogenous genes. Conclusions Feature selection using the LASSO regression model in A. thaliana identified 155 features. Correlation coefficients revealed that whole RNA features are important for determining cleavage efficiency in addition to features around the cleavage sites. The LASSO regression model can predict cleavage efficiency in endogenous and exogenous genes using only sequence information. The model revealed the significance of the effect of multiple determinants on cleavage efficiency, suggesting that sequence features are important for RNA degradation mechanisms in A. thaliana.

Keywords