Investigação de métodos de sumarização automática multidocumento baseados em hierarquias conceituais

Zacarias, Andressa Caroline Inácio

dc.contributor	Di Felippo, Ariani
dc.contributor	http://lattes.cnpq.br/8648412103197455
dc.contributor	http://lattes.cnpq.br/4398305062037262
dc.creator	Zacarias, Andressa Caroline Inácio
dc.date.accessioned	2016-10-20T16:19:25Z
dc.date.available	2016-10-20T16:19:25Z
dc.date.created	2016-10-20T16:19:25Z
dc.date.issued	2016-03-29
dc.identifier	ZACARIAS, Andressa Caroline Inácio. Investigação de métodos de sumarização automática multidocumento baseados em hierarquias conceituais. 2016. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2016. Disponível em: https://repositorio.ufscar.br/handle/ufscar/7974.
dc.identifier	https://repositorio.ufscar.br/handle/ufscar/7974
dc.description.abstract	The Automatic Multi-Document Summarization (MDS) aims at creating a single summary, coherent and cohesive, from a collection of different sources texts, on the same topic. The creation of these summaries, in general extracts (informative and generic), requires the selection of the most important sentences from the collection. Therefore, one may use superficial linguistic knowledge (or statistic) or deep knowledge. It is important to note that deep methods, although more expensive and less robust, produce more informative extracts and with more linguistic quality. For the Portuguese language, the sole deep methods that use lexical-conceptual knowledge are based on the frequency of the occurrence of the concepts in the collection for the selection of a content. Considering the potential for application of semantic-conceptual knowledge, the proposition is to investigate MDS methods that start with representation of lexical concepts of source texts in a hierarchy for further exploration of certain hierarchical properties able to distinguish the most relevant concepts (in other words, the topics from a collection of texts) from the others. Specifically, 3 out of 50 CSTNews (multi-document corpus of Portuguese reference) collections were selected and the names that have occurred in the source texts of each collection were manually indexed to the concepts of the WordNet from Princenton (WN.Pr), engendering at the end, an hierarchy with the concepts derived from the collection and other concepts inherited from the WN.PR for the construction of the hierarchy. The hierarchy concepts were characterized in 5 graph metrics (of relevancy) potentially relevant to identify the concepts that compose a summary: Centrality, Simple Frequency, Cumulative Frequency, Closeness and Level. Said characterization was analyzed manually and by machine learning algorithms (ML) with the purpose of verifying the most suitable measures to identify the relevant concepts of the collection. As a result, the measure Centrality was disregarded and the other ones were used to propose content selection methods to MDS. Specifically, 2 sentences selection methods were selected which make up the extractive methods: (i) CFSumm whose content selection is exclusively based on the metric Simple Frequency, and (ii) LCHSumm whose selection is based on rules learned by machine learning algorithms from the use of all 4 relevant measures as attributes. These methods were intrinsically evaluated concerning the informativeness, by means of the package of measures called ROUGE, and the evaluation of linguistic quality was based on the criteria from the TAC conference. Therefore, the 6 human abstracts available in each CSTNews collection were used. Furthermore, the summaries generated by the proposed methods were compared to the extracts generated by the GistSumm summarizer, taken as baseline. The two methods got satisfactory results when compared to the GistSumm baseline and the CFSumm method stands out upon the LCHSumm method.
dc.language	por
dc.publisher	Universidade Federal de São Carlos
dc.publisher	UFSCar
dc.publisher	Programa de Pós-Graduação em Linguística - PPGL
dc.publisher	Câmpus São Carlos
dc.rights	Acesso aberto
dc.subject	Sumarização Automática Multidocumento
dc.subject	Métricas de grafo
dc.subject	Hierarquia léxico-conceitual
dc.subject	Automatic multi-document summarization
dc.subject	Graph metrics
dc.subject	Lexical-conceptual hierarchy
dc.title	Investigação de métodos de sumarização automática multidocumento baseados em hierarquias conceituais
dc.type	Tesis

Este ítem pertenece a la siguiente institución

Universidade Federal de São Carlos (Brasil)