Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library

Camele, Genaro; Hasperué, Waldo; Ronchetti, Franco; Quiroga, Facundo Manuel

dc.creator	Camele, Genaro
dc.creator	Hasperué, Waldo
dc.creator	Ronchetti, Franco
dc.creator	Quiroga, Facundo Manuel
dc.date	2021-10
dc.date	2021
dc.date	2022-02-02T17:59:55Z
dc.date.accessioned	2023-07-15T05:22:22Z
dc.date.available	2023-07-15T05:22:22Z
dc.identifier	http://sedici.unlp.edu.ar/handle/10915/130348
dc.identifier	isbn:978-987-633-574-4
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/7473114
dc.description	Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.
dc.description	Workshop: WBDMD - Base de Datos y Minería de Datos
dc.description	Red de Universidades con Carreras en Informática
dc.format	application/pdf
dc.format	311-320
dc.language	en
dc.rights	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights	Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.subject	Ciencias Informáticas
dc.subject	Big Data
dc.subject	Machine learning
dc.subject	Classification Models
dc.subject	Apache Spark
dc.subject	Spark ML
dc.title	Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
dc.type	Objeto de conferencia
dc.type	Objeto de conferencia

Este ítem pertenece a la siguiente institución

Universidad Nacional de La Plata (Argentina)