Distributed d-fuzzstream: agrupamento fuzzy não supervisionado distribuído em fluxo de dados

Schick, Leonardo

Tesis

Fecha

2019-02-26

Registro en:

SCHICK, Leonardo. Distributed d-fuzzstream: agrupamento fuzzy não supervisionado distribuído em fluxo de dados. 2019. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2019. Disponível em: https://repositorio.ufscar.br/handle/ufscar/11543.

https://repositorio.ufscar.br/handle/ufscar/11543

http://repositorioslatinoamericanos.uchile.cl/handle/2250/4041963

Autor

Schick, Leonardo

Institución

Universidade Federal de São Carlos (Brasil)

Resumen

In the last decade, the interest in managing unlimited sequences of data produced at high rates, known as data stream (DS), has grown. Applications of DSs include sensor networks, weather analysis, stock market analysis, network traffic monitoring, among others. As in other data domains, there is an interest in extracting useful knowledge from DS using automatic techniques. However, extracting potentially relevant information in this domain requires techniques that can overcome storage constraints and that can conduct a continuous learning process, adapting to changes in data distribution. One of the great challenges in DS learning is the ability to adapt to the data production rate, avoiding the accumulation of non-processed examples and ensuring real-time responses. If the production of new examples is faster than the processing speed, then the data accumulate continuously and are often discarded, making the learning of new concepts outdated. Recently, with the popularization of distributed computing tools, many machine learning algorithms have been adapted to be executed in a distributed manner. Despite this, most of these algorithms are adaptations of classical machine learning algorithms to handle large volumes of data. Therefore, there is a lack of algorithms for learning from DS that can be executed distributedly, mostly in the context of unsupervised learning. This work presents a new distributed approach to fuzzy clustering in data stream using the Framework Online-Offline. The approach includes two different processing strategies for each of the phases of the framework. The online phase's strategy summarizes data in distinct summary structures, which are unified by an intermediate phase. In the offline phase, the distributed processing strategy groups the result generated by the unified summary structure with several algorithms and different number of groups. A total of 135 experiments were conducted on 6 databases with a large number of examples, including four synthetic and two real data sets, from which internal, external and performance metrics were collected. The results showed that the distributed approach generated clustering results similar to those generated without the approach, processing on average 18,200 examples per second. This shows that the proposal is relevant because it was able to generate similar partitions using only 6% of the processing time compared to the non-distributed version.

Materias

Fluxo de dados

Agrupamento de dados

Teoria de conjuntos Fuzzy

Framework online-offline

Sistemas distribuídos

Computação distribuída

Data stream

Data clustering

Fuzzy sets theory

Online-offline framework

Distributed systems

Distributed computing

Mostrar el registro completo del ítem