Tese de Doutorado
Projeto e análise de sistemas de busca na web
Claudine Santos Badue
Web search engines are expensive to maintain, expensive to operate, and hard to design. Modern search engines rely on clusters of server machines for query processing. Thus, the performance of parallel query processing in a cluster of index servers is crucial for modern Web search engines. The objective of this thesis is to provide a performance framework for the design and analysis of the infrastructure of Web search engines. In this framework we (i) investigate and analyze the imbalance issue in a computational cluster composed of homogeneous index servers and (ii) propose a capacity planning model for Web search engines.In a cluster of index servers, the response time basically depends on the service time of the slowest server to generate a partial ranked answer. Previous approaches investigate performance issues in this context using simulation, analytical modeling, experimentation, or a combination of them. Nevertheless, these approaches simply assume balanced service times among homogeneous index servers, a scenario that we did not observe in our experimentation. On the contrary, we found that even with a balanced distribution of the document collection among index servers, relations between the frequency of a query in the collection and the size of its corresponding inverted lists lead to imbalances in query service times at these same servers, because these relations affect disk cache behavior. Further, the relative sizes of the main memory at each index server (with regard to disk space usage) and the number of servers participating in the parallel query processing also affect imbalance of local query service times.Predicting the performance of a Web search engine is usually done empirically through experimentation, requiring a costly setup. Thus, modeling is of natural appeal in this context. We introduce a capacity planning model for Web search engines that considers the imbalance in query service times among homogeneous index servers. Our model, which is based on a queueing network, is simple and yet reasonably accurate. We discuss how we tune it up and how we apply it to predict, for instance, the impact on the query response time when parameters such as CPUs and disks are changed. This allows the manager of the search engine to determine a priori whether a new configuration of the system will keep the query response under specified constraints. Our approach is distinct and, we believe, useful to predict the performance of real Web search engines.