Data preparation pipeline recommendation via meta-learning

Zagatti, Fernando Rezende

dc.contributor	Silva, Diego Furtado
dc.contributor	http://lattes.cnpq.br/7662777934692986
dc.contributor	http://lattes.cnpq.br/8060946497875227
dc.creator	Zagatti, Fernando Rezende
dc.date.accessioned	2021-08-23T14:16:25Z
dc.date.accessioned	2022-10-10T21:36:56Z
dc.date.available	2021-08-23T14:16:25Z
dc.date.available	2022-10-10T21:36:56Z
dc.date.created	2021-08-23T14:16:25Z
dc.date.issued	2021-05-26
dc.identifier	ZAGATTI, Fernando Rezende. Data preparation pipeline recommendation via meta-learning. 2021. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/14790.
dc.identifier	https://repositorio.ufscar.br/handle/ufscar/14790
dc.identifier.uri	http://repositorioslatinoamericanos.uchile.cl/handle/2250/4044973
dc.description.abstract	Data preparation is a essential stage in the machine learning pipeline, aiming to convert noisy and disordered data into refined data compatible with the algorithms. However, data preparation is time-consuming and requires specialized knowledge. In this scenario, automating data preparation and decreasing the effort made by data scientists at this stage is a scientific challenge of great practical relevance. Each dataset has its particular characteristics and can be interpreted in different ways. Despite its relevance, current automated machine learning (AutoML) platforms disregard or make simple hardcoded pipelines for data preparation. Trying to fill this gap, we present a meta-learning-based recommendation system for data preparation. Our system recommends five pipelines, ranked by their relevance, so it is useful for users with varied experience levels. Using the top recommendation to simulate an entirely automatic choice of data preparation pipeline, we demonstrate that our proposal allows a better performance of an AutoML system, unable to find a classification model due to the noisy data. Besides, our method's accuracy rates are similar to those achieved by a reinforcement-learning-based algorithm with the same goal, but it is up to two orders of magnitude faster. Morevover, we demonstrate our method in a real-world application and evaluate its benefits and limitations in this scenario.
dc.language	eng
dc.publisher	Universidade Federal de São Carlos
dc.publisher	UFSCar
dc.publisher	Programa de Pós-Graduação em Ciência da Computação - PPGCC
dc.publisher	Câmpus São Carlos
dc.rights	http://creativecommons.org/licenses/by-nc-nd/3.0/br/
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Brazil
dc.subject	Automatização
dc.subject	Preparação de dados
dc.subject	Meta-aprendizado
dc.subject	Pré-processamento
dc.subject	Aprendizado de máquina
dc.subject	Automated
dc.subject	Data preparation
dc.subject	Meta-learning
dc.subject	Preprocessing
dc.subject	Machine learning
dc.title	Data preparation pipeline recommendation via meta-learning
dc.type	Tesis

Este ítem pertenece a la siguiente institución

Universidade Federal de São Carlos (Brasil)