Dissertação de Mestrado
Uma avaliação da utilização de matrizes de afinidades na validação de agrupamentos de dados
Fecha
2013-10-25Autor
Rafael Xavier Valente
Institución
Resumen
Differently from a supervised machine learning problem, where one seeks to find an approximate function from a labeled dataset, the unsupervised problems does not contain any information to guide the learning process. In this case, a criterion must be adopted for the establishment of the partitions. The problem with this approach is that usually the objective functions commonly used are degenerated according to the number of groups, thus the simple optimization of the adopted criterion is not able to provide the optimum number of partitions for a given dataset. Therefore, partitions for differente number of groups are performed and according to another metric these partitions are comparatively evaluated to select the optimum number of groups. In this work a new metric is proposed to identify the number of groups from datasets which can be clustered in compact clusters. In order to achieve the objective, the fuzzy partition matrix obtained from an algorithm like Fuzzy C-Means (FCM) is used to calculate a proximity matrix between the objects. Some factors are then calculated from the proximity matrix to compose the final index that will be used to compare the partitions and select the one which most agree which the proposed metric. Yet, the proximity matrix calculated makes it possible for the final user to visualize the clusters in two dimensions to validate the obtained results. To demonstrate the validity of the proposed metric, experiments with synthetics and real datasets are provided. The results obtained for the controlled cases, where the datasets generator functions are known, show the validity of the development metric. For the real datasets, the obtained results are compared with other metrics to validate it. In this case, the results obtained show the new approach are consistent with other well-known metrics. In these cases, the proximity matrix presented are primordial to visualize the partitions and consequently validate it against the intrinsic structures of the datasets.