’HALITE IND. DS’: fast and scalable subspace clustering for multidimensional data streams

Silva, Afonso E. da; Sanches, Lucas L.; Fraideinberze, Antonio C.; Cordeiro, Robson Leonardo Ferreira

Actas de congresos

Fecha

2016-05

Registro en:

SIAM International Conference on Data Mining, XVI, 2016, Miami.

9781611974348

2167-0102

http://www.producao.usp.br/handle/BDPI/51004

http://dx.doi.org/10.1137/1.9781611974348.40

http://repositorioslatinoamericanos.uchile.cl/handle/2250/1646031

Autor

Silva, Afonso E. da

Sanches, Lucas L.

Fraideinberze, Antonio C.

Cordeiro, Robson Leonardo Ferreira

Institución

Universidade de São Paulo (Brasil)

Resumen

Given a data stream with many attributes and high frequency of events, how to cluster similar events? Can it be done in real time? For example, how to cluster decades of frequent measurements of tens of climatic attributes to aid real time alert systems in forecasting extreme climatic events, such as oods and hurricanes? The task of clustering data with many attributes is known as subspace clustering. Today, there exists a need for algorithms of this type well-suited to process multidimensional data streams, for which real time processing is highly desirable. This paper proposes the new algorithm 'HALITE ind. ds' - a fast, scalable and highly accurate subspace clustering algorithm for multidimensional data streams. It improves upon an existing technique that was originally designed to process static (not streams) data. Our main contributions are: (1) Analysis of Data Streams: the new algorithm takes advantage of the knowledge obtained from clustering past data to easy clustering data in the present. This fact allows our 'HALITE IND. DS' to be considerably faster than its base algorithm, yet obtaining the same accuracy of results; (2) Real Time Processing: as opposed to the state-of-the-art, 'HALITE IND. DS' is fast and scalable, making it feasible to analyze streams with many attributes and high frequency of events in real time; (3) Experiments: we ran experiments using synthetic data and a real multidimensional stream with almost one century of climatic data. Our 'HALITE IND. DS' was up to 217 times faster than 5 representative works, i.e., its base algorithm plus 4 others from the state-of-the-art, always presenting highly accurate results.

Materias

subspace clustering

moderate-to-high dimensional data streams

real time processing

climatic streams

Mostrar el registro completo del ítem