Identification of cyclin protein using gradient boost decision tree algorithm

Hasan Zulfiqar; Shi-Shi Yuan; Qin-Lai Huang; Zi-Jie Sun; Fu-Ying Dao; Xiao-Long Yu; Hao Lin

Computational and Structural Biotechnology Journal (Jan 2021)

Identification of cyclin protein using gradient boost decision tree algorithm

Hasan Zulfiqar,
Shi-Shi Yuan,
Qin-Lai Huang,
Zi-Jie Sun,
Fu-Ying Dao,
Xiao-Long Yu,
Hao Lin

Affiliations

Hasan Zulfiqar: School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
Shi-Shi Yuan: School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
Qin-Lai Huang: School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
Zi-Jie Sun: School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
Fu-Ying Dao: School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
Xiao-Long Yu: School of Materials Science and Engineering, Hainan University, Haikou 570228, China; Corresponding authors.
Hao Lin: School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Corresponding authors.

Journal volume & issue: Vol. 19
pp. 4123 – 4131

Abstract

Read online

Cyclin proteins are capable to regulate the cell cycle by forming a complex with cyclin-dependent kinases to activate cell cycle. Correct recognition of cyclin proteins could provide key clues for studying their functions. However, their sequences share low similarity, which results in poor prediction for sequence similarity-based methods. Thus, it is urgent to construct a machine learning model to identify cyclin proteins. This study aimed to develop a computational model to discriminate cyclin proteins from non-cyclin proteins. In our model, protein sequences were encoded by seven kinds of features that are amino acid composition, composition of k-spaced amino acid pairs, tri peptide composition, pseudo amino acid composition, geary correlation, normalized moreau-broto autocorrelation and composition/transition/distribution. Afterward, these features were optimized by using analysis of variance (ANOVA) and minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) technique. A gradient boost decision tree (GBDT) classifier was trained on the optimal features. Five-fold cross-validated results showed that our model would identify cyclins with an accuracy of 93.06% and AUC value of 0.971, which are higher than the two recent studies on the same data.

Published in Computational and Structural Biotechnology Journal

ISSN: 2001-0370 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Chemical technology: Biotechnology
Website: https://www.journals.elsevier.com/computational-and-structural-biotechnology-journal

About the journal

Abstract

Keywords