Actas de congresos
Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test
Fecha
2016-08Registro en:
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, XXII, 2016, San Francisco.
9781450342322
Autor
Reis, Denis dos
Flach, Peter
Matwin, Stan
Batista, Gustavo Enrique de Almeida Prado Alves
Institución
Resumen
Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the
y, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(logN) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N logN) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.