Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test

Reis, Denis dos; Flach, Peter; Matwin, Stan; Batista, Gustavo Enrique de Almeida Prado Alves

Actas de congresos

Fecha

2016-08

Registro en:

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, XXII, 2016, San Francisco.

9781450342322

http://www.producao.usp.br/handle/BDPI/51024

http://dx.doi.org/10.1145/2939672.2939836

http://repositorioslatinoamericanos.uchile.cl/handle/2250/1646097

Autor

Reis, Denis dos

Flach, Peter

Matwin, Stan

Batista, Gustavo Enrique de Almeida Prado Alves

Institución

Universidade de São Paulo (Brasil)

Resumen

Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the y, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(logN) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N logN) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.

Materias

Treap

Mostrar el registro completo del ítem