Multi-level machine learning prediction of protein–protein interactions in Saccharomyces cerevisiae

Julian Zubek; Marcin Tatjewski; Adam Boniecki; Maciej Mnich; Subhadip Basu; Dariusz Plewczynski

doi:10.7717/peerj.1041

PeerJ (Jul 2015)

Multi-level machine learning prediction of protein–protein interactions in Saccharomyces cerevisiae

Julian Zubek,
Marcin Tatjewski,
Adam Boniecki,
Maciej Mnich,
Subhadip Basu,
Dariusz Plewczynski

Affiliations

Julian Zubek: Centre of New Technologies, University of Warsaw, Warsaw, Poland
Marcin Tatjewski: Centre of New Technologies, University of Warsaw, Warsaw, Poland
Adam Boniecki: Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
Maciej Mnich: Faculty of Mathematics and Computer Science, Jagiellonian University, Cracow, Poland
Subhadip Basu: Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
Dariusz Plewczynski: Centre of New Technologies, University of Warsaw, Warsaw, Poland

DOI: https://doi.org/10.7717/peerj.1041
Journal volume & issue: Vol. 3
p. e1041

Abstract

Read online Read online

Accurate identification of protein–protein interactions (PPI) is the key step in understanding proteins’ biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein–protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein–protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

Published in PeerJ

ISSN: 2167-8359 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Medicine; Science: Biology (General)
Website: https://peerj.com/

About the journal

Abstract

Keywords