Tese
Agrupamento automático de notícias de jornais on-line usando técnicas de machine learning para clustering de textos no idioma português
Fecha
2020-02-13Autor
Lúcia Helena de Magalhães
Institución
Resumen
Clustering is the technique of organization of data into groups whose members are somewhat
similar. The purpose of this research is to use the techniques of Text Mining, Natural Language
Processing, Machine Learning and Clustering, to create groups of similar reports from a
sample retrieved from online newspapers, considering that there are few studies related to the
clustering theme of news published in Portuguese. The lack of research in this area ends up
reinforcing the scarcity of information, which interferes in the development of automated
solutions capable of retrieving and comparing the articles featured in the media, published in
Portuguese, and grouping them by similarity. Thus, this study aims to use an unsupervised
learning methodology, which is capable of automatically grouping news published in the
Brazilian Portuguese language, posted in the mainstream media. In addition, it also seeks to
identify which are the main methods used in the text clustering process; apply these techniques
to a collection of news published in the Portuguese language and verify the performance of
the clustering algorithms when fed by a corpus of texts; apply the methodology in different
corpora and discuss the success of the technique in each case; to investigate the effective
possibility of document clustering and to analyze the difficulties encountered for different
samples. For that, the concepts and areas related to the theme are presented, as well as the
bibliographic review of related works, the proposed methodology and some experiments that
allow developing certain arguments and proving some hypotheses. For the experiments, first,
the news were collected and then, the pre-processing of the reports was carried out, a stage
in which the stop words were removed and the tokenization and stemming techniques were
applied. Thus, with the corpus prepared, the main characteristics of the texts were extracted
and the documents were represented in a vector space model. The similarity between the
materials was found by calculating the similarity, immediately the clustering technique was
applied and consequently the groups were formed. For better visualization, validation and
interpretation of results, clusters were presented in dendograms and in dispersion diagrams.
The main conclusions of this research indicated that the pre-processing stage requires a
special effort to guarantee the quality of the data. As well as the complexity of the Portuguese
language, the need to update the list of stop words, the detection of which characteristics are
most important and, in general, the complexity of the problems related to the high
dimensionality of the data were evidenced throughout the process of this study. Distance
measurements also played an important role in clustering analysis, but there is no one that
best suits all clustering problems. The k-means algorithm obtained the best results for this type
of information and Hierarchical Clustering presented difficulties for larger corpus, since similar
documents were allocated to different groups. The Affinity Propagation algorithm, on the other
hand, diverged as to the ideal number of clusters, but achieved good performance when
grouping by similarity.