Development of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets

Medina Ortiz, David; Contreras, Sebastián; Quiroz, Cristofer; Olivera Nappa, Álvaro María

dc.creator	Medina Ortiz, David
dc.creator	Contreras, Sebastián
dc.creator	Quiroz, Cristofer
dc.creator	Olivera Nappa, Álvaro María
dc.date.accessioned	2020-05-06T00:08:30Z
dc.date.available	2020-05-06T00:08:30Z
dc.date.created	2020-05-06T00:08:30Z
dc.date.issued	2020
dc.identifier	Frontiers in Molecular Biosciences February 2020 \| Volume 7 \| Article 13
dc.identifier	10.3389/fmolb.2020.00013
dc.identifier	https://repositorio.uchile.cl/handle/2250/174427
dc.description.abstract	In highly non-linear datasets, attributes or features do not allow readily finding visual patterns for identifying common underlying behaviors. Therefore, it is not possible to achieve classification or regression using linear or mildly non-linear hyperspace partition functions. Hence, supervised learning models based on the application of most existing algorithms are limited, and their performance metrics are low. Linear transformations of variables, such as principal components analysis, cannot avoid the problem, and even models based on artificial neural networks and deep learning are unable to improve the metrics. Sometimes, even when features allow classification or regression in reported cases, performance metrics of supervised learning algorithms remain unsatisfyingly low. This problem is recurrent in many areas of study as, per example, the clinical, biotechnological, and protein engineering areas, where many of the attributes are correlated in an unknown and very non-linear fashion or are categorical and difficult to relate to a target response variable. In such areas, being able to create predictive models would dramatically impact the quality of their outcomes, generating an immediate added value for both the scientific and general public. In this manuscript, we present RV-Clustering, a library of unsupervised learning algorithms, and a new methodology designed to find optimum partitions within highly non-linear datasets that allow deconvoluting variables and notoriously improving performance metrics in supervised learning classification or regression models. The partitions obtained are statistically cross-validated, ensuring correct representativity and no over-fitting. We have successfully tested RV-Clustering in several highly non-linear datasets with different origins. The approach herein proposed has generated classification and regression models with high-performance metrics, which further supports its ability to generate predictive models for highly non-linear datasets. Advantageously, the method does not require significant human input, which guarantees a higher usability in the biological, biomedical, and protein engineering community with no specific knowledge in the machine learning area.
dc.language	en
dc.publisher	Frontiers Media
dc.rights	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Chile
dc.source	Frontiers in Molecular Biosciences
dc.subject	Highly non-linear datasets
dc.subject	Supervised learning algorithms
dc.subject	Clustering
dc.subject	Statistical techniques
dc.subject	Recursive binary methods
dc.title	Development of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets
dc.type	Artículo de revista

Este ítem pertenece a la siguiente institución

Universidad de Chile