Classificação de sites a partir das análises estrutural e textual

Ribas, Oeslei Taborda

Brasil | masterThesis

dc.contributor	Kaestner, Celso Antônio Alves
dc.creator	Ribas, Oeslei Taborda
dc.date.accessioned	2013-10-16T17:43:26Z
dc.date.accessioned	2022-12-06T14:50:03Z
dc.date.available	2013-10-16T17:43:26Z
dc.date.available	2022-12-06T14:50:03Z
dc.date.created	2013-10-16T17:43:26Z
dc.date.issued	2013-08-28
dc.identifier	RIBAS, Oeslei Taborda. Classificação de sites a partir das análises estrutural e textual. 2013. 125 f. Dissertação (Mestrado em Computação Aplicada) - Universidade Tecnológica Federal do Paraná, Curitiba, 2013.
dc.identifier	http://repositorio.utfpr.edu.br/jspui/handle/1/616
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/5256627
dc.description.abstract	With the wide use of the web nowadays, also with its constant growth, task of automatic classification of websites has gained increasing importance. In many occasions it is necessary to block access to specific sites, such as in the case of access to adult content sites in elementary and secondary schools. In the literature different studies has appeared proposing new methods for classification of sites, with the goal of increasing the rate of pages correctly categorized. This work aims to contribute to the current methods of classification by comparing four aspects involved in the classification process: classification algorithms, dimensionality (amount of selected attributes), attributes evaluation metrics and selection of textual and structural attributes present in webpages. We use the vector model to treat text and an machine learning classical approach according to the classification task. Several metrics are used to make the selection of the most relevant terms, and classification algorithms from different paradigms are compared: probabilistic (Na¨ıve Bayes), decision tree (C4.5), instance-based learning (KNN - K-Nearest Neighbor) and support vector machine (SVM). The experiments were performed on a dataset containing two languages, English and Portuguese. The results show that it is possible to obtain a classifier with good success indexes using only the information from the anchor text in hyperlinks, in the experiments the classifier based on this information achieved 99.59% F-measure.
dc.publisher	Universidade Tecnológica Federal do Paraná
dc.publisher	Curitiba
dc.publisher	Programa de Pós-Graduação em Computação Aplicada
dc.subject	Sites da web - Avaliação e classificação
dc.subject	Processamento de textos (Computação)
dc.subject	Aprendizado do computador
dc.subject	Redes neurais (Computação)
dc.subject	HTML (Linguagem de marcação de documento)
dc.subject	Métodos de simulação
dc.subject	Web sites - Ratings and rankings
dc.subject	Text processing (Computer science)
dc.subject	Machine learning
dc.subject	Neural networks (Computer science)
dc.subject	HTML (Document marKup language)
dc.subject	Simulation methods
dc.title	Classificação de sites a partir das análises estrutural e textual
dc.type	masterThesis

Este ítem pertenece a la siguiente institución

Universidade Tecnológica Federal do Paraná (Brasil)