ExtraWeb: um sumarizador de documentos Web baseado em etiquetas HTML e ontologia
2006-07-10Registro en:
SILVA, Patrick Pedreira. ExtraWeb: um sumarizador de documentos Web baseado em etiquetas HTML e ontologia.. 2006. 168 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2006.
Silva, Patrick Pedreira
This dissertation presents an automatic summarizer of Web documents based on
both HTML tags and ontological knowledge. It has been derived from two independent
approaches: one that focuses solely upon HTML tags, and another that focuses only on
ontological knowledge. The three approaches were implemented and assessed,
indicating that associating both knowledge types have a promising descriptive power for
Web documents. The resulting prototype has been named ExtraWeb.
The ExtraWeb system explores the HTML structure of Web documents in
Portuguese and semantic information using the Yahoo ontology in Portuguese. This has
been enriched with additional terms extracted from both a thesaurus, Diadorim and the
Wikipedia. In a simulated Web search, ExtraWeb achieved a similar utility degree to
Google one, showing its potential to signal through extracts the relevance of the
retrieved documents. This has been an important issue recently. Extracts may be
particularly useful as surrogates of the current descriptions provided by the existing
search engines. They may even substitute the corresponding source documents. In the
former case, those descriptions do not necessarily convey relevant content of the
documents; in the latter, reading full documents demands a substantial overhead of Web
users. In both cases, extracts may improve the search task, provided that they actually
signal relevant content. So, ExtraWeb is a potential plug-in of search engines, to
improve their descriptions. However, its scability and insertion in a real setting have not
yet been explored.