Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Prati, Ronaldo C.; Batista, Gustavo Enrique de Almeida Prado Alves; Silva, Diego Furtado

dc.creator	Prati, Ronaldo C.
dc.creator	Batista, Gustavo Enrique de Almeida Prado Alves
dc.creator	Silva, Diego Furtado
dc.date.accessioned	2016-10-19T17:06:09Z
dc.date.accessioned	2018-07-04T17:10:01Z
dc.date.available	2016-10-19T17:06:09Z
dc.date.available	2018-07-04T17:10:01Z
dc.date.created	2016-10-19T17:06:09Z
dc.date.issued	2015-10
dc.identifier	Knowledge and Information Systems, London, v. 45, n. 1, p. 247-270, Oct. 2015
dc.identifier	0219-1377
dc.identifier	http://www.producao.usp.br/handle/BDPI/50994
dc.identifier	10.1007/s10115-014-0794-3
dc.identifier	http://dx.doi.org/10.1007/s10115-014-0794-3
dc.identifier.uri	http://repositorioslatinoamericanos.uchile.cl/handle/2250/1645554
dc.description.abstract	In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as “Are all learning paradigms equally affected by class imbalance?”, “What is the expected performance loss for different imbalance degrees?” and “How much of the performance losses can be recovered by the treatment methods?”. In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data set with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance.We apply such experimental design in a large-scale experimental evaluation with 22 data set and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5%) for the most balanced distributions up to 10% of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20% for 1% of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically, about 30% or less of the performance that was lost due to class imbalance was recovered by these methods.
dc.language	eng
dc.publisher	Springer
dc.publisher	London
dc.relation	Knowledge and Information Systems
dc.rights	Copyright Springer-Verlag
dc.rights	closedAccess
dc.subject	Class imbalance
dc.subject	Experimental setup
dc.subject	Sampling methods
dc.title	Class imbalance revisited: a new experimental setup to assess the performance of treatment methods
dc.type	Artículos de revistas

Este ítem pertenece a la siguiente institución

Universidade de São Paulo (Brasil)