Agrupamento automático de notícias de jornais on-line usando técnicas de machine learning para clustering de textos no idioma português

Lúcia Helena de Magalhães

Tese

Fecha

2020-02-13

Registro en:

http://hdl.handle.net/1843/37525

http://repositorioslatinoamericanos.uchile.cl/handle/2250/3815283

Autor

Lúcia Helena de Magalhães

Institución

Universidade Federal de Minas Gerais (Brasil)

Resumen

Clustering is the technique of organization of data into groups whose members are somewhat similar. The purpose of this research is to use the techniques of Text Mining, Natural Language Processing, Machine Learning and Clustering, to create groups of similar reports from a sample retrieved from online newspapers, considering that there are few studies related to the clustering theme of news published in Portuguese. The lack of research in this area ends up reinforcing the scarcity of information, which interferes in the development of automated solutions capable of retrieving and comparing the articles featured in the media, published in Portuguese, and grouping them by similarity. Thus, this study aims to use an unsupervised learning methodology, which is capable of automatically grouping news published in the Brazilian Portuguese language, posted in the mainstream media. In addition, it also seeks to identify which are the main methods used in the text clustering process; apply these techniques to a collection of news published in the Portuguese language and verify the performance of the clustering algorithms when fed by a corpus of texts; apply the methodology in different corpora and discuss the success of the technique in each case; to investigate the effective possibility of document clustering and to analyze the difficulties encountered for different samples. For that, the concepts and areas related to the theme are presented, as well as the bibliographic review of related works, the proposed methodology and some experiments that allow developing certain arguments and proving some hypotheses. For the experiments, first, the news were collected and then, the pre-processing of the reports was carried out, a stage in which the stop words were removed and the tokenization and stemming techniques were applied. Thus, with the corpus prepared, the main characteristics of the texts were extracted and the documents were represented in a vector space model. The similarity between the materials was found by calculating the similarity, immediately the clustering technique was applied and consequently the groups were formed. For better visualization, validation and interpretation of results, clusters were presented in dendograms and in dispersion diagrams. The main conclusions of this research indicated that the pre-processing stage requires a special effort to guarantee the quality of the data. As well as the complexity of the Portuguese language, the need to update the list of stop words, the detection of which characteristics are most important and, in general, the complexity of the problems related to the high dimensionality of the data were evidenced throughout the process of this study. Distance measurements also played an important role in clustering analysis, but there is no one that best suits all clustering problems. The k-means algorithm obtained the best results for this type of information and Hierarchical Clustering presented difficulties for larger corpus, since similar documents were allocated to different groups. The Affinity Propagation algorithm, on the other hand, diverged as to the ideal number of clusters, but achieved good performance when grouping by similarity.

Materias

Agrupamento de notícias

Mineração de textos

Aprendizado de máquina

Processamento de linguagem natural

Mostrar el registro completo del ítem