doctoralThesis
Uma arquitetura para análise de agrupamentos sobre bases de dados distribuídas
Fecha
2009-03-06Registro en:
GORGÔNIO, Flavius da Luz e. Uma arquitetura para análise de agrupamentos sobre bases de dados distribuídas. 2009. 156f. Tese (Doutorado em Engenharia Elétrica e Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2009.
Autor
Gorgônio, Flavius da Luz e
Resumen
Data mining can be defined as a set of techniques for knowledge extraction and search
of useful and previously unknown patterns in large multidimensional databases.
Clustering is the process of discovering data clusters within high-dimensional databases,
based on similarities, with a minimal knowledge of their structure. Distributed data
clustering is a recent approach to deal with distributed databases, since traditional
clustering algorithms require centering all databases in a single dataset. Moreover,
current privacy requirements in distributed databases demand algorithms with the ability
to process clustering securely. Thus, an increasing need of methods to mining data
stored in a distributed way has motivated the development of algorithms to analyze each
database separately and to combine the partial results to get a final result. This thesis
presents a framework for cluster analysis in distributed databases using traditional
algorithms, as K-means and self-organizing maps. This approach reduces significantly
the amount of data transferred between remote units and the central unit. The
framework includes a strategy, based on vectorial quantization, that extract a
representatives subset, in order to get partial views of the existing clusters in each
horizontal and/or vertical partitions of the database. Later, the representatives of each
local unit are sent to the central unit, which carry out a combination of the partial results
applying a clustering algorithm over all representative subsets. The experimental results
with different datasets show that the framework proposed obtains results very close and
with effectiveness comparable to conventional data mining techniques, where all the
databases are transferred to a central unit in the pre-processing stage.