Dissertação
Escalonamento adaptativo para o Apache Hadoop
Autor
Cassales, Guilherme Weigert
Institución
Resumen
Many alternatives have been employed in order to process all the data generated by
current applications in a timely manner. One of these alternatives, the Apache Hadoop,
combines parallel and distributed processing with the MapReduce paradigm in order to
provide an environment that is able to process a huge data volume using a simple
programming model. However, Apache Hadoop has been designed for dedicated and
homogeneous clusters, a limitation that creates challenges for those who wish to use the
framework in other circumstances. Often, acquiring a dedicated cluster can be
impracticable due to the cost, and the acquisition of reposition parts can be a threat to
the homogeneity of a cluster. In these cases, an option commonly used by the
companies is the usage of idle computing resources in their network, however the
original distribution of Hadoop would show serious performance issues in these
conditions. Thus, this study was aimed to improve Hadoop’s capacity of adapting to
pervasive and shared environments, where the availability of resources will undergo
variations during the execution. Therefore, context-awareness techniques were used in
order to collect information about the available capacity in each worker node and
distributed communication techniques were used to update this information on
scheduler. The joint usage of both techniques aimed at minimizing and/or eliminating
the overload that would happen on shared nodes, resulting in an improvement of up to
50% on performance in a shared cluster, when compared to the original distribution, and
indicated that a simple solution can positively impact the scheduling, increasing the
variety of environments where the use of Hadoop is possible.