Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue

Tosta, Fabricio Elder da Silva

dc.contributor	Di Felippo, Ariani
dc.contributor	http://lattes.cnpq.br/8648412103197455
dc.contributor	http://lattes.cnpq.br/0011930854854466
dc.creator	Tosta, Fabricio Elder da Silva
dc.date.accessioned	2015-03-11
dc.date.accessioned	2016-06-02T20:25:23Z
dc.date.available	2015-03-11
dc.date.available	2016-06-02T20:25:23Z
dc.date.created	2015-03-11
dc.date.created	2016-06-02T20:25:23Z
dc.date.issued	2014-02-27
dc.identifier	TOSTA, Fabricio Elder da Silva. Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue. 2014. 119 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2014.
dc.identifier	https://repositorio.ufscar.br/handle/ufscar/5796
dc.description.abstract	Traditionally, Multilingual Multi-document Automatic Summarization (MMAS) is a computational application that, from a single collection of source-texts on the same subject/topic in at least two languages, produces an informative and generic summary (extract) in one of these languages. The simplest methods automatically translate the source-texts and, from a monolingual collection, apply content selection strategies based on shallow and/or deep linguistic knowledge. Therefore, the MMAS applications need to identify the main information of the collection, avoiding the redundancy, but also treating the problems caused by the machine translation (MT) of the full source-texts. Looking for alternatives to the traditional scenario of MMAS, we investigated two methods (Method 1 and 2) that once based on deep linguistic knowledge of lexical-conceptual level avoid the full MT of the sourcetexts, generating informative and cohesive/coherent summaries. In these methods, the content selection starts with the score and the ranking of the original sentences based on the frequency of occurrence of the concepts in the collection, expressed by their common names. In Method 1, only the most well-scored and non redundant sentences from the user s language are selected to compose the extract, until it reaches the compression rate. In Method 2, the original sentences which are better ranked and non redundant are selected to the summary without privileging the user s language; in cases which sentences that are not in the user s language are selected, they are automatically translated. In order to producing automatic summaries according to Methods 1 and 2 and their subsequent evaluation, the CM2News corpus was built. The corpus has 20 collections of news texts, 1 original text in English and 1 original text in Portuguese, both on the same topic. The common names of CM2News were identified through morphosyntactic annotation and then it was semiautomatically annotated with the concepts in Princeton WordNet through the Mulsen graphic editor, which was especially developed for the task. For the production of extracts according to Method 1, only the best ranked sentences in Portuguese were selected until the compression rate was reached. For the production of extracts according to Method 2, the best ranked sentences were selected, without privileging the language of the user. If English sentences were selected, they were automatically translated into Portuguese by the Bing translator. The Methods 1 and 2 were evaluated intrinsically considering the linguistic quality and informativeness of the summaries. To evaluate linguistic quality, 15 computational linguists analyzed manually the grammaticality, non-redundancy, referential clarity, focus and structure / coherence of the summaries and to evaluate the informativeness of the sumaries, they were automatically compared to reference sumaries by ROUGE measures. In both evaluations, the results have shown the better performance of Method 1, which might be explained by the fact that sentences were selected from a single source text. Furthermore, we highlight the best performance of both methods based on lexicalconceptual knowledge compared to simpler methods of MMAS, which adopted the full MT of the source-texts. Finally, it is noted that, besides the promising results on the application of lexical-conceptual knowledge, this work has generated important resources and tools for MMAS, such as the CM2News corpus and the Mulsen editor.
dc.publisher	Universidade Federal de São Carlos
dc.publisher	BR
dc.publisher	UFSCar
dc.publisher	Programa de Pós-Graduação em Linguística - PPGL
dc.rights	Acesso Aberto
dc.subject	Linguística
dc.subject	Sumarização automática
dc.subject	Sumarização multidocumento multilíngue
dc.subject	Conhecimento léxico-conceitual
dc.subject	Estratégias de seleção de conteúdo
dc.subject	Multilingual multi-document automatic summarization
dc.subject	Lexical-conceptual knowledge
dc.subject	Content selection
dc.title	Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngue
dc.type	Tesis

Este ítem pertenece a la siguiente institución

Universidade Federal de São Carlos (Brasil)