Tesis
Distributed d-fuzzstream: agrupamento fuzzy não supervisionado distribuído em fluxo de dados
Fecha
2019-02-26Registro en:
Autor
Schick, Leonardo
Institución
Resumen
In the last decade, the interest in managing unlimited sequences of data produced at high rates, known as data stream (DS), has grown. Applications of DSs include sensor networks, weather analysis, stock market analysis, network traffic monitoring, among others. As in other data domains, there is an interest in extracting useful knowledge from DS using automatic techniques. However, extracting potentially relevant information in this domain requires techniques that can overcome storage constraints and that can conduct a continuous learning process, adapting to changes in data distribution. One of the great challenges in DS learning is the ability to adapt to the data production rate, avoiding the accumulation of non-processed examples and ensuring real-time responses. If the production of new examples is faster than the processing speed, then the data accumulate continuously and are often discarded, making the learning of new concepts outdated. Recently, with the popularization of distributed computing tools, many machine learning algorithms have been adapted to be executed in a distributed manner. Despite this, most of these algorithms are adaptations of classical machine learning algorithms to handle large volumes of data. Therefore, there is a lack of algorithms for learning from DS that can be executed distributedly, mostly in the context of unsupervised learning. This work presents a new distributed approach to fuzzy clustering in data stream using the Framework Online-Offline. The approach includes two different processing strategies for each of the phases of the framework. The online phase's strategy summarizes data in distinct summary structures, which are unified by an intermediate phase. In the offline phase, the distributed processing strategy groups the result generated by the unified summary structure with several algorithms and different number of groups. A total of 135 experiments were conducted on 6 databases with a large number of examples, including four synthetic and two real data sets, from which internal, external and performance metrics were collected. The results showed that the distributed approach generated clustering results similar to those generated without the approach, processing on average 18,200 examples per second. This shows that the proposal is relevant because it was able to generate similar partitions using only 6% of the processing time compared to the non-distributed version.