Generación de instancias sintéticas para clases desbalanceadas

ATLANTIDA IRENE SANCHEZ VIVAR

info:eu-repo/semantics/masterThesis

Registro en:

http://inaoe.repositorioinstitucional.mx/jspui/handle/1009/565

https://repositorioslatinoamericanos.uchile.cl/handle/2250/7805783

Autor

ATLANTIDA IRENE SANCHEZ VIVAR

Institución

Instituto Nacional de Astrofísica, Óptica y Electrónica (México)

Resumen

One of the main difficulties that are present in a classification task for machine learning is the problem of unbalanced classes. This problem is that, in some data sets, some classes have many more elements than others, causing the classifier to learn more from them, ignoring the small ones, which right classification is generally of bigger interest. With respect to this thesis, only problems of two classes are assumed (minority and majority). To solve this problem, many techniques have been proposed, from those methods that that modify existent algorithms and those that create new algorithms, to those that change the distribution of data with resampling, all of them with the objective of favoring the classification of the minority class. This thesis focuses on the resampling methods, specifically on the oversampling of instances, a technique that changes the distribution of data adding more instances of the minority class, that has obtained satisfactory results. In this work, two new oversampling methods of instances are proposed: GIS-G and GIS-GF. Both methods start from the idea of creating groups of the minority class and generating the new synthetic instances, while the traditional methods are focused only on the numeric values assignment. The first method, named GIS-G, generates new examples by interpolating the numerical values of pairs of instances inside a group. The second method, GIS-GF, generates the values of the numerical attributes of the instance with just one instance as seed, making use of the standard deviation of the values inside of the group. To test the proposed methods, twenty databases of synthetic data, and twentythree databases taken from real domains were used. The four main oversampling methods (ROS, SMOTE, Borderline-SMOTE1 y Borderline-SMOTE2), apart from the methods proposed in this thesis, were applied. Ten-fold cross validation over each database were used. Six different classifiers (AdaBoost M1, Naive Bayes, K-NN, C4.5, PART, and Backpropagation) were tested, and the full process was repeated ten times, to finally obtain the averages of the results. It was shown, through the ANOVA Analysis and through and T tests, that the obtained results from the proposed methods present, on average, better results over the used databases, with respect to those results obtained by the other methods. These results are estatistically significant.

Una de las principales dificultades que se presentan en una tarea de clasificación en aprendizaje computacional es el problema de clases desbalanceadas. Este problema se refiere a que, en algunos conjuntos de datos, algunas clases tienen muchos más ejemplos que otras, provocando que el clasificador tienda a aprender m´as de ellas e ignorar las pequeñas, cuya correcta clasificación generalmente es la de mayor interés. Para fines de esta tesis, se asumen problemas de sólo dos clases: minoritaria y mayoritaria. Para solucionar este problema se han propuesto varias técnicas, desde las que modifican algoritmos existentes y las que crean nuevos algoritmos hasta las que cambian la distribución de los datos con re-muestreo, todas ellas con la finalidad de favorecer la clasificación de la clase minoritaria. Esta tesis está enfocada en los métodos de re-muestreo, específicamente en el sobre-muestreo de instancias, una técnica que cambia la distribución de datos agregando más instancias de la clase minoritaria y que ha obtenido resultados satisfactorios. En este trabajo se proponen dos métodos nuevos de sobre-muestreo de instancias, GIS-G y GIS-GF. Ambos métodos parten de la idea de crear grupos de la clase minoritaria y generar las instancias sintéticas dentro de cada grupo, y no de manera global como lo hacen los métodos tradicionales. Además, propone una forma diferente de asignar valores nominales a las nuevas instancias, mientras que los métodos tradicionales únicamente se enfocan en la asignación de valores numéricos. El primer método, llamado GIS-G, genera nuevos ejemplos interpolando los valores numéricos de pares de instancias dentro de un grupo. El segundo método, llamado GIS-GF, genera los valores de los atributos numéricos de la nueva instancia con sólo una instancia como semilla, haciendo uso de la desviación estándar de los valores dentro del grupo. Para probar los métodos propuestos se utilizaron veinte bases de datos sintéticas y veintitrés tomadas de dominios reales, se aplicaron los cuatro métodos de sobremuestreo principales (ROS, SMOTE, Borderline-SMOTE1 y Borderline-SMOTE2) además de los dos métodos propuestos en esta tesis, se utilizó validación cruzada de diez capas sobre cada base de datos, se probaron seis clasificadores diferentes (Ada- Boost M1, Naive Bayes, K-NN, C4.5, PART y Backpropagation) y el proceso completo se repitió diez veces para finalmente obtener los promedios de los resultados. Se mostró, mediante el Análisis ANOVA y pruebas