IEEE Access (Jan 2019)
A Randomized Clustering Forest Approach for Efficient Prediction of Protein Functions
Abstract
With the advances in genetic sequencing technology, the automated assignment of protein function has become a key challenge in bioinformatics and computational biology. In nature, many kinds of proteins consist of a variety of structural domains, and each domain almost holds its own function independently or implements a new function in cooperation with neighbors. Thus, a multi-domain protein function prediction problem can be converted into multi-instance multi-label (MIML) learning tasks. In this paper, we propose a novel ensemble MIML algorithm called multi-instance multi-label randomized clustering forest (MIMLRC-Forest) for protein function prediction. In MIMLRC-Forest, we develop a set of hierarchical clustering trees and conduct a label transfer mechanism to identify the relevant function labels in learning process. The clustering tree with a hierarchical structure can handle the multi-label problem by exploiting more discriminable label concepts at higher-level nodes and by transferring less discriminable labels into the lower-level nodes. Then, the label dependency can be computed by aggregating tree labels for protein function prediction. Extensive experiments on five real-world protein data sets show the effectiveness of the proposed algorithm compared with several state-of-the-art baselines, including MIMLSVM, MIMLNN, MIML-kNN, EnMIMLNN, and M3MIML.
Keywords