Actas de congresos
’HALITE IND. DS’: fast and scalable subspace clustering for multidimensional data streams
Fecha
2016-05Registro en:
SIAM International Conference on Data Mining, XVI, 2016, Miami.
9781611974348
2167-0102
Autor
Silva, Afonso E. da
Sanches, Lucas L.
Fraideinberze, Antonio C.
Cordeiro, Robson Leonardo Ferreira
Institución
Resumen
Given a data stream with many attributes and high frequency of events, how to cluster similar events? Can it be done in real time? For example, how to cluster decades of frequent measurements of tens of climatic attributes to aid real time alert systems in forecasting extreme climatic events, such as
oods and hurricanes? The task of clustering data with many attributes is known as subspace clustering. Today, there exists a need for algorithms of this type well-suited to process multidimensional data streams, for which real time processing is highly desirable. This paper proposes the new algorithm 'HALITE ind. ds' - a fast, scalable and highly accurate subspace clustering algorithm for multidimensional data streams. It improves upon an existing technique that was originally designed to process static (not streams) data. Our main contributions are: (1) Analysis of Data Streams: the new algorithm takes advantage of the knowledge obtained from clustering past data to easy clustering data in the present. This fact allows our 'HALITE IND. DS' to be considerably faster than its base algorithm, yet obtaining the same accuracy of results; (2) Real Time Processing: as opposed to the state-of-the-art, 'HALITE IND. DS' is fast and scalable, making it feasible to analyze streams with many attributes and high frequency of events in real time; (3) Experiments: we ran experiments using synthetic data and a real multidimensional stream with almost one century of climatic data. Our 'HALITE IND. DS' was up to 217 times faster than 5 representative works, i.e., its base algorithm plus 4 others from the state-of-the-art, always presenting highly accurate results.