Dissertação de Mestrado
Verificação de unicidade de URLs em coletores de páginas Web
Fecha
2011-03-10Autor
Wallace Favoreto Henrique
Institución
Resumen
One of the main difficulties in the development of a web crawler is in the component for verifying URL uniqueness, since complex data structures are required to ensure that the identification of URLs still not collected will be performed effectively and efficiently.If the component for verifying URL uniqueness is not effective and efficient, the performance of the other web crawler components will be affected.In this work we present a new algorithm for verifying URLs uniqueness, referred to as VEUNI (VErificador de UNIcidade de URLs).The algorithm VEUNI was compared with the best known algorithm in the literature, which was considered a baseline in the experiments.The comparative study between the algorithm VEUNI and the baseline was performed through a simulation of a collection of approximately 350 million pages, using a reference collection called ClueWeb09.Experimental results show that the proposed algorithm is an alternative that can be successfully used in web crawlers designed to be scalable to the entire Web.