Tesis
An Integrated Data Model and Web Protocol for Arbitrarily Structured Information
Autor
Álvarez Cavazos, Francisco
Institución
Resumen
Within the Web´s data ecosystem dwell applications that consume and produce information with varying degrees of structuring, ranging from very structured business data to the semistructured or unstructured data found in documents which contain a significant amount of text. Current database technology was not designed for the Web and, consequently, database communication protocols, query models, and even data models are inadequate for the demands of "data everywhere." Thus, a technique to uniformly store, search, transport and update all the variety of information within Web or intranet environments has yet to be designed. The Web context require the data management community to address: (a) data modeling and basic querying to support multiple data models to accommodate many types of data sources, (b) powerful search mechanisms that accept keyword queries and select relevant structured sources that may answer them, and (c) the ability to combine answers from structured and unstructured data in a principled way. In consequence, this dissertation constructively designs a technique to store, search, transport and update unstructured and structured information for Web or intranet-based environments: the Relational-text (RELTEX) protocol. Central to the design of the protocol is an integrated model for structured and unstructured data and its associated declarative language interface, namely, the RELTEX model and calculus. The RELTEX model is constructively defined departing from the relational and information retrieval models and their associated retrieval strategies. The model´s data items are tuples with structured "columns" and unstructured "fields" that further allow idiosyncratic schema in the form of "extension fields", which are tuple-specific name/value pairs. This flexibility allows representation of totally unstructured information, totally structured information, and mixtures of structured and unstructured data, such as tables where tuples have a varying number of fields over time. RELTEX calculus extends tuple relational calculus to consider text fields, similarity matches, match ranking, and sort order. Then, building on top of the formally-defined RELTEX data model and calculus and departing from the architecture of the Web, the RELTEX protocol is defined as a resource-centric protocol to describe and manipulate data and schema of unstructured and structured data sources. An equivalence mapping between RELTEX and the relational and information retrieval models is provided. The mapping suggests a wide range of applicability for RELTEX, thus proving the model´s value. On the other hand, the RELTEX protocol is distinguished from other techniques for data access and storage in the Web since (a) it supports structured and unstructured data manipulation and retrieval, (b) it offers operations to describe and manipulate both common and idiosyncratic schema of data items and (c) it directly federates data items to the Web over a compound key; thus demonstrating novelty and value. The RELTEX protocol, model and calculus are proven feasible by means of a proof-of-concept implementation. Departing from a motivating scenario, the prototype is used to provide representative examples of data and schema operations. Having demonstrated that the RELTEX protocol and model contribute towards the data modeling and basic querying challenge imposed by the Web, we expect that this dissertation benefits researchers and practitioners alike with a novel, valuable, effective and feasible technique to store, search, transport and update unstructured and structured information in the Web environment.