IEEE Access (Jan 2018)
Identification of Mammalian Enzymatic Proteins Based on Sequence-Derived Features and Species-Specific Scheme
Abstract
Enzymatic proteins (EPs) are widely distributed in organisms and cells and implicated in biochemical processes. Without these proteins, most biochemical reactions slowly occur at mild temperatures and pressures in living bodies. Given the wide application of these proteins in drug discovery and disease therapy, they should be accurately identified, but specific methods have yet to be reported to determine EPs from primary sequences. To achieve this, in this paper, we propose a novel method for predicting mammalian EPs. We collect a series of sequence-based features observed in EPs and perform detailed analyses to investigate the intrinsic properties of enzymatic and non-EPs. To remove redundant features and select an optimal feature subset, we introduce Fisher-Markov selector and incremental feature selection. Based on the optimal feature subset, our method achieves the area under the curve values of 0.731, 0.820, and 0.822 on three training datasets using fivefold cross validation. Our strategy also shows a good generalization capability on independent testing datasets. We further compare the differences between our species-specific and universal models, which confirm the effectiveness of introducing the species-specific scheme. We believe that our method is useful for biomedical research on EPs. Our proposed method is implemented in a user-friendly Web server named predict EPs, which is freely available for academic use at http://www.inforstation.com/webservers/PEP/.
Keywords