Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study

Michael W. Daniels; Daniel Dvorkin; Rani K. Powers; Katerina Kechris

doi:10.3390/mca26020040

Mathematical and Computational Applications (May 2021)

Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study

Michael W. Daniels,
Daniel Dvorkin,
Rani K. Powers,
Katerina Kechris

Affiliations

Michael W. Daniels: Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, University of Louisville, Louisville, KY 40202, USA
Daniel Dvorkin: The Bioinformatics CRO, Inc., Niceville, FL 32578, USA
Rani K. Powers: Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02155, USA
Katerina Kechris: Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO 80045, USA

DOI: https://doi.org/10.3390/mca26020040
Journal volume & issue: Vol. 26, no. 2
p. 40

Abstract

Read online

Integrating gene-level data is useful for predicting the role of genes in biological processes. This problem has typically focused on supervised classification, which requires large training sets of positive and negative examples. However, training data sets that are too small for supervised approaches can still provide valuable information. We describe a hierarchical mixture model that uses limited positively labeled gene training data for semi-supervised learning. We focus on the problem of predicting essential genes, where a gene is required for the survival of an organism under particular conditions. We applied cross-validation and found that the inclusion of positively labeled samples in a semi-supervised learning framework with the hierarchical mixture model improves the detection of essential genes compared to unsupervised, supervised, and other semi-supervised approaches. There was also improved prediction performance when genes are incorrectly assumed to be non-essential. Our comparisons indicate that the incorporation of even small amounts of existing knowledge improves the accuracy of prediction and decreases variability in predictions. Although we focused on gene essentiality, the hierarchical mixture model and semi-supervised framework is standard for problems focused on prediction of genes or other features, with multiple data types characterizing the feature, and a small set of positive labels.

Published in Mathematical and Computational Applications

ISSN: 1300-686X (Print); 2297-8747 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Applied mathematics. Quantitative methods; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/mca

About the journal

Abstract

Keywords