Development of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets

David Medina-Ortiz; David Medina-Ortiz; Sebastián Contreras; Cristofer Quiroz; Álvaro Olivera-Nappa; Álvaro Olivera-Nappa

doi:10.3389/fmolb.2020.00013

Frontiers in Molecular Biosciences (Feb 2020)

Development of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets

David Medina-Ortiz,
David Medina-Ortiz,
Sebastián Contreras,
Cristofer Quiroz,
Álvaro Olivera-Nappa,
Álvaro Olivera-Nappa

Affiliations

David Medina-Ortiz: Departamento de Ingeniería Química, Biotecnología y Materiales, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile, Santiago, Chile
David Medina-Ortiz: Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile
Sebastián Contreras: Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile
Cristofer Quiroz: Facultad de Ingeniería, Universidad Autónoma de Chile, Talca, Chile
Álvaro Olivera-Nappa: Departamento de Ingeniería Química, Biotecnología y Materiales, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile, Santiago, Chile
Álvaro Olivera-Nappa: Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile

DOI: https://doi.org/10.3389/fmolb.2020.00013
Journal volume & issue: Vol. 7

Abstract

Read online

In highly non-linear datasets, attributes or features do not allow readily finding visual patterns for identifying common underlying behaviors. Therefore, it is not possible to achieve classification or regression using linear or mildly non-linear hyperspace partition functions. Hence, supervised learning models based on the application of most existing algorithms are limited, and their performance metrics are low. Linear transformations of variables, such as principal components analysis, cannot avoid the problem, and even models based on artificial neural networks and deep learning are unable to improve the metrics. Sometimes, even when features allow classification or regression in reported cases, performance metrics of supervised learning algorithms remain unsatisfyingly low. This problem is recurrent in many areas of study as, per example, the clinical, biotechnological, and protein engineering areas, where many of the attributes are correlated in an unknown and very non-linear fashion or are categorical and difficult to relate to a target response variable. In such areas, being able to create predictive models would dramatically impact the quality of their outcomes, generating an immediate added value for both the scientific and general public. In this manuscript, we present RV-Clustering, a library of unsupervised learning algorithms, and a new methodology designed to find optimum partitions within highly non-linear datasets that allow deconvoluting variables and notoriously improving performance metrics in supervised learning classification or regression models. The partitions obtained are statistically cross-validated, ensuring correct representativity and no over-fitting. We have successfully tested RV-Clustering in several highly non-linear datasets with different origins. The approach herein proposed has generated classification and regression models with high-performance metrics, which further supports its ability to generate predictive models for highly non-linear datasets. Advantageously, the method does not require significant human input, which guarantees a higher usability in the biological, biomedical, and protein engineering community with no specific knowledge in the machine learning area.

Published in Frontiers in Molecular Biosciences

ISSN: 2296-889X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Biology (General)
Website: https://www.frontiersin.org/journals/molecular-biosciences

About the journal

Abstract

Keywords