FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto

dc.creator	Basgall, María José
dc.creator	Naiouf, Ricardo Marcelo
dc.creator	Fernández, Alberto
dc.date.accessioned	2022-01-20T10:20:51Z
dc.date.accessioned	2022-10-15T07:04:47Z
dc.date.available	2022-01-20T10:20:51Z
dc.date.available	2022-10-15T07:04:47Z
dc.date.created	2022-01-20T10:20:51Z
dc.date.issued	2021-08
dc.identifier	Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-19
dc.identifier	2079-9292
dc.identifier	http://hdl.handle.net/11336/150370
dc.identifier	CONICET Digital
dc.identifier	CONICET
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/4358415
dc.description.abstract	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
dc.language	eng
dc.publisher	Molecular Diversity Preservation International
dc.relation	info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757
dc.relation	info:eu-repo/semantics/altIdentifier/doi/http://dx.doi.org/10.3390/electronics10151757
dc.rights	https://creativecommons.org/licenses/by/2.5/ar/
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	APACHE SPARK
dc.subject	BIG DATA
dc.subject	CLASSIFICATION
dc.subject	DATA REDUCTION
dc.subject	PREPROCESSING TECHNIQUES
dc.title	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
dc.type	info:eu-repo/semantics/article
dc.type	info:ar-repo/semantics/artículo
dc.type	info:eu-repo/semantics/publishedVersion

Este ítem pertenece a la siguiente institución

Consejo Nacional de Investigaciones Científicas y Tecnológicas (Argentina)