dc.creatorMedina Ortiz, David
dc.creatorContreras, Sebastián
dc.creatorQuiroz, Cristofer
dc.creatorOlivera Nappa, Álvaro María
dc.date.accessioned2020-05-06T00:08:30Z
dc.date.available2020-05-06T00:08:30Z
dc.date.created2020-05-06T00:08:30Z
dc.date.issued2020
dc.identifierFrontiers in Molecular Biosciences February 2020 | Volume 7 | Article 13
dc.identifier10.3389/fmolb.2020.00013
dc.identifierhttps://repositorio.uchile.cl/handle/2250/174427
dc.description.abstractIn highly non-linear datasets, attributes or features do not allow readily finding visual patterns for identifying common underlying behaviors. Therefore, it is not possible to achieve classification or regression using linear or mildly non-linear hyperspace partition functions. Hence, supervised learning models based on the application of most existing algorithms are limited, and their performance metrics are low. Linear transformations of variables, such as principal components analysis, cannot avoid the problem, and even models based on artificial neural networks and deep learning are unable to improve the metrics. Sometimes, even when features allow classification or regression in reported cases, performance metrics of supervised learning algorithms remain unsatisfyingly low. This problem is recurrent in many areas of study as, per example, the clinical, biotechnological, and protein engineering areas, where many of the attributes are correlated in an unknown and very non-linear fashion or are categorical and difficult to relate to a target response variable. In such areas, being able to create predictive models would dramatically impact the quality of their outcomes, generating an immediate added value for both the scientific and general public. In this manuscript, we present RV-Clustering, a library of unsupervised learning algorithms, and a new methodology designed to find optimum partitions within highly non-linear datasets that allow deconvoluting variables and notoriously improving performance metrics in supervised learning classification or regression models. The partitions obtained are statistically cross-validated, ensuring correct representativity and no over-fitting. We have successfully tested RV-Clustering in several highly non-linear datasets with different origins. The approach herein proposed has generated classification and regression models with high-performance metrics, which further supports its ability to generate predictive models for highly non-linear datasets. Advantageously, the method does not require significant human input, which guarantees a higher usability in the biological, biomedical, and protein engineering community with no specific knowledge in the machine learning area.
dc.languageen
dc.publisherFrontiers Media
dc.rightshttp://creativecommons.org/licenses/by-nc-nd/3.0/cl/
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 Chile
dc.sourceFrontiers in Molecular Biosciences
dc.subjectHighly non-linear datasets
dc.subjectSupervised learning algorithms
dc.subjectClustering
dc.subjectStatistical techniques
dc.subjectRecursive binary methods
dc.titleDevelopment of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets
dc.typeArtículo de revista


Este ítem pertenece a la siguiente institución