info:eu-repo/semantics/article
Decoding the structure of the WWW: a comparative analysis of web crawls
Fecha
2007-08Registro en:
Serrano, Maria Angeles; Maguitman, Ana Gabriela; Boguña, Marian; Fortunato, Santo; Vespignani, Alessandro; Decoding the structure of the WWW: a comparative analysis of web crawls; Association for Computing Machinary; Acm Transactions On The Web; 1; 2; 8-2007; 1131-1155
1559-1131
CONICET Digital
CONICET
Autor
Serrano, Maria Angeles
Maguitman, Ana Gabriela
Boguña, Marian
Fortunato, Santo
Vespignani, Alessandro
Resumen
The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been recently tackled by char-acterizing the properties of its representative graphs, in which vertices and directed edges areidentified with Web pages and hyperlinks, respectively. Data gathered in large-scale crawls havebeen analyzed by several groups resulting in a general picture of the WWW that encompassesmany of the complex properties typical of rapidly evolving networks. In this article, we report adetailed statistical analysis of the topological properties of four different WWW graphs obtainedwith different crawlers. We find that, despite the very large size of the samples, the statistical mea-sures characterizing these graphs differ quantitatively, and in some cases qualitatively, dependingon the domain analyzed and the crawl used for gathering the data. This spurs the issue of thepresence of sampling biases and structural differences of Web crawls that might induce propertiesnot representative of the actual global underlying graph. In short, the stability of the widely ac-cepted statistical description of the Web is called into question. In order to provide a more accuratecharacterization of the Web graph, we study statistical measures beyond the degree distribution,such as degree-degree correlation functions or the statistics of reciprocal connections. The latterappears to enclose the relevant correlations of the WWW graph and carry most of the topologica.