masterThesis
Um método para seleção de atributos em bases de dados de classificação hierárquica multirrótulo
Fecha
2022-07-07Registro en:
VIEIRA, Raimundo Osvaldo. Um método para seleção de atributos em bases de dados de classificação hierárquica multirrótulo. 2022. Dissertação (Mestrado em Ciência da Computação) - Universidade Tecnológica Federal do Paraná, Ponta Grossa, 2022.
Autor
Vieira, Raimundo Osvaldo
Resumen
Hierarchical multi-label classification problems usually need to deal with datasets that have a large number of attributes and labels, which can negatively interfere with the performance of the classifier. The application of dimensionality reduction methods can provide a significant improvement in the performance of classifiers. Feature selection is one of the dimensionality reduction methods in databases and comprises choosing the most relevant attributes from the originals. Three main approaches to feature selection can be used: filter, wrapper and embedded. In particular, the filter approach makes the selection based only on the characteristics of the data itself and independently of the training algorithm. In the context of hierarchical multi-label classification, some feature selection methods have been proposed. These methods make use of consolidated techniques in contexts of flat classification and single-label classification, showing good results. In this sense, this work investigated the applicability of the Fisher Score measure for the feature selection in hierarchical multi-label classification scenarios and proposed a method for this task using the filter approach. The FSF-HMC method consists of evaluating the attributes from the individual calculation of the Fisher Score. This calculation has been adapted to consider the class hierarchy. The attributes evaluated with a score above the average Fisher Score calculated for all attributes are selected to compose the reduced dataset that will be used to evaluate the classifier. To validate the proposed method, experiments were performed with 10 Gene Ontology databases. These experiments consisted of evaluating the performance of two multi-label hierarchical classifiers, Clus-HMC and MHC-CNN, in terms of the AUPRC measure, with a comparison of the results produced from the original datasets and the reduced datasets. The results of the experiments demonstrate that there was a gain in terms of the percentage of reduction in the number of attributes over the original data and that the performance of the classifiers was statistically equivalent for the original and reduced datasets.