Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors

Jian Zhang; Lixin Lv; Donglei Lu; Denan Kong; Mohammed Abdoh Ali Al-Alashaari; Xudong Zhao

doi:10.1186/s12859-020-03826-6

BMC Bioinformatics (Oct 2020)

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors

Jian Zhang,
Lixin Lv,
Donglei Lu,
Denan Kong,
Mohammed Abdoh Ali Al-Alashaari,
Xudong Zhao

Affiliations

Jian Zhang: College of Artificial Intelligence, Wuxi Vocational College of Science and Technology
Lixin Lv: College of Artificial Intelligence, Wuxi Vocational College of Science and Technology
Donglei Lu: College of Artificial Intelligence, Wuxi Vocational College of Science and Technology
Denan Kong: College of Information and Computer Engineering, Northeast Forestry University
Mohammed Abdoh Ali Al-Alashaari: College of Information and Computer Engineering, Northeast Forestry University
Xudong Zhao: College of Information and Computer Engineering, Northeast Forestry University

DOI: https://doi.org/10.1186/s12859-020-03826-6
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords