Semi-supervised Learning Predicts Approximately One Third of the Alternative Splicing Isoforms as Functional Proteins

Yanqi Hao; Recep Colak; Joan Teyra; Carles Corbi-Verge; Alexander Ignatchenko; Hannes Hahne; Mathias Wilhelm; Bernhard Kuster; Pascal Braun; Daisuke Kaida; Thomas Kislinger; Philip M. Kim

doi:10.1016/j.celrep.2015.06.031

Cell Reports (Jul 2015)

Semi-supervised Learning Predicts Approximately One Third of the Alternative Splicing Isoforms as Functional Proteins

Yanqi Hao,
Recep Colak,
Joan Teyra,
Carles Corbi-Verge,
Alexander Ignatchenko,
Hannes Hahne,
Mathias Wilhelm,
Bernhard Kuster,
Pascal Braun,
Daisuke Kaida,
Thomas Kislinger,
Philip M. Kim

Affiliations

Yanqi Hao: Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON M5S 1AS, Canada
Recep Colak: Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON M5S 1AS, Canada
Joan Teyra: Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON M5S 1AS, Canada
Carles Corbi-Verge: Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON M5S 1AS, Canada
Alexander Ignatchenko: Department of Medical Biophysics, University of Toronto, Toronto, ON M5G 1L7, Canada
Hannes Hahne: Chair for Proteomics and Bioanalytics, TU Muenchen, Freising 85354, Germany
Mathias Wilhelm: Chair for Proteomics and Bioanalytics, TU Muenchen, Freising 85354, Germany
Bernhard Kuster: Chair for Proteomics and Bioanalytics, TU Muenchen, Freising 85354, Germany
Pascal Braun: Lehrstuhl fuer Systembiologie der Pflanzen, TU Muenchen, Munich, Germany
Daisuke Kaida: Frontier Research Core for Life Sciences, University of Toyama, Toyama 930-8555, Japan
Thomas Kislinger: Department of Medical Biophysics, University of Toronto, Toronto, ON M5G 1L7, Canada
Philip M. Kim: Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON M5S 1AS, Canada

DOI: https://doi.org/10.1016/j.celrep.2015.06.031
Journal volume & issue: Vol. 12, no. 2
pp. 183 – 189

Abstract

Read online

Alternative splicing acts on transcripts from almost all human multi-exon genes. Notwithstanding its ubiquity, fundamental ramifications of splicing on protein expression remain unresolved. The number and identity of spliced transcripts that form stably folded proteins remain the sources of considerable debate, due largely to low coverage of experimental methods and the resulting absence of negative data. We circumvent this issue by developing a semi-supervised learning algorithm, positive unlabeled learning for splicing elucidation (PULSE; http://www.kimlab.org/software/pulse), which uses 48 features spanning various categories. We validated its accuracy on sets of bona fide protein isoforms and directly on mass spectrometry (MS) spectra for an overall AU-ROC of 0.85. We predict that around 32% of “exon skipping” alternative splicing events produce stable proteins, suggesting that the process engenders a significant number of previously uncharacterized proteins. We also provide insights into the distribution of positive isoforms in various functional classes and into the structural effects of alternative splicing.

Published in Cell Reports

ISSN: 2211-1247 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science: Biology (General)
Website: http://www.cell.com/cell-reports/home

About the journal