Tesis
Aprofundamento da caracterização linguístico-computacional da complementaridade em um corpus jornalístico multidocumento
Fecha
2019-02-27Registro en:
Autor
Souza, Jackson Wilke da Cruz
Institución
Resumen
In the context of the dissemination of digital information, CISCO, an agency of web security, projects that 3.3 Zettabytes of information will be circulated on the Web in 2021. In this context, sub-areas of Automatic Natural Languages Processing (NLP) develop linguistic-computational solutions to dynamize the short time the user has in front of the demand of information in circulation on web. One of these sub-areas is Automatic Multi-document Summarization (AMS), which aims to create automatic summaries from collections of source texts that deal with the same subject. In order to make possible the selection of contents to automatic summaries and to improve this technique, some studies are based in linguistic descriptions of multi-document phenomena. One of these phenomena is complementarity, which occurs when, in a sentence pair (S1, S2), S2 elaborates some information presents in S1. The theoretical model Cross-Document Structure Theory (CST) translates the complementarity into three semantic relations: Historical Background and Follow-up (temporal) and Elaboration (timeless). Some studies in this area indicate that (superficial) linguistic temporal attributes are relevant to automatically identify such CST relations, obtaining automatic classifiers with 75% accuracy. Thus, under the hypothesis that deep linguistic information could generate more efficient classifiers, we propose a refined set of attributes that characterize the complementarity. After the manual analysis of the pairs of sentences annotated with the CST relations of complementarity of CSTNews corpus, we built a typology of 32 signs, organized in seven categories, namely: anaphora, textual structure, morphology, syntax, semantics, pragmatics. Using symbolic algorithms of Machine Learning, it was possible to construct and train new classifiers, whose accuracy surpassed the state-of-the-art. Thus, we contribute with (i) Descriptive Linguistics, as a typology organized in signs that present systematically the evidences and characteristics of complementarity in sentences pairs of journalistic texts, and with (ii) NLP, as it produced a more refined and specific description for the automatic identification of complementarity and consequently the selection of content to the automatic multi-document summaries.