México
| Tesis de Doctorado
Nuevos Algoritmos Basados en Grafos y Clustering para el Tratamiento de Complejidades de los Datos
New Algorithms Based on Graphs and Clustering for handling Data Complexities
Autor
Gúzman Ponce, Angélica; 702275
Gúzman Ponce, Angélica
Institución
Resumen
Doctoral thesis Nowadays, knowledge extraction from data is an essential task for decisionmaking in many areas. However, the data sets commonly present some negative problems (complexities) that decrease the performance in the knowledge extraction process. The imbalanced distribution of data between classes and the presence of noise and/or class overlap are data intrinsic characteristics that frequently decrease the performance of the knowledge extraction because data are assumed to keep a uniform distribution and free from any other problem. All these issues have been studied in Pattern Recognition and Data Mining, because of their impact on the performance of the learning models. Thus this Ph.D. thesis addresses class imbalance, class overlap and/or noise through techniques that reduce and clean the most represented class. Among the solutions to handle with the class imbalance problem, new algorithms based on graphs are proposed. This idea arises from the fact that many real-world problems (network analysis, chemical models, remote sensing, among others) have been tackled by using graph-based strategies, in which the problem is transformed in terms of vertices and edges. Keeping this in mind, the proposals presented in this Ph.D. thesis consider the most represented class as as a complete graph in such a way that a representative subset of majority class instances is obtained through reduction criteria. Regarding the data sets with class imbalance and class overlap and/or noise, the proposals include the use of clustering algorithms as a cleaning strategy. It is well known that these algorithms are used to group instances according to similar characteristics; however, the proposal here presented makes use of their ability to detect noisy instances. By this, the application of a clustering algorithm is carried out before facing the class imbalance. As a further extension to the proposals presented in this Ph.D. thesis and due to the growing interest in Big Data problems, the last part of this report introduces a graph-based algorithm to handle class imbalance in large-scale data sets. Becas nacionales del CONACYT