Um novo modelo de ordenação de documentos baseados em correlação entre termos

Bruno Augusto Vivas e Possas

dc.contributor	Nivio Ziviani
dc.contributor	Wagner Meira Junior
dc.contributor	Edleno Silva de Moura
dc.contributor	Ricardo Baeza-yates
dc.contributor	Berthier Ribeiro de Araujo Neto
dc.contributor	Imre Simon
dc.creator	Bruno Augusto Vivas e Possas
dc.date.accessioned	2019-08-09T15:06:53Z
dc.date.accessioned	2022-10-03T22:25:05Z
dc.date.available	2019-08-09T15:06:53Z
dc.date.available	2022-10-03T22:25:05Z
dc.date.created	2019-08-09T15:06:53Z
dc.date.issued	2005-08-22
dc.identifier	http://hdl.handle.net/1843/RVMR-6HKGAL
dc.identifier.uri	http://repositorioslatinoamericanos.uchile.cl/handle/2250/3801832
dc.description.abstract	This work presents a new approach for ranking documents in the vector space model. Thenovelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed ef_ciently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be ef_ciently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. The application of our approach provides a simple, effective, ef_cient and parameterized way to process disjunctive, conjunctive, phrase queries, and automatically structured complex queries. All known approaches that account for correlation among index terms were initially designed for processing only disjunctive queries. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2 gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision _gures of 14.7% and 16.4% for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9% and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained with the standard vector model (increases in processing time varied from 30% to 300%). The experimental results also show that the set-based model can be successfully used for automatically structuring queries. For instance, using the TREC-8 test collection, our technique led to gains in average precision of roughly 28% with regard to a BM25 ranking formula. Our results suggest that the set-based vector model provides a correlation-based ranking formula that is effective with general collections and computationally practical.
dc.publisher	Universidade Federal de Minas Gerais
dc.publisher	UFMG
dc.rights	Acesso Aberto
dc.subject	Recuperação de informação
dc.subject	Ordenação de documentos
dc.title	Um novo modelo de ordenação de documentos baseados em correlação entre termos
dc.type	Tese de Doutorado

Este ítem pertenece a la siguiente institución

Universidade Federal de Minas Gerais (Brasil)