Dissertação de Mestrado
Subject classification through context-enriched language models
Fecha
2015-02-23Autor
Alexandre Guelman Davis
Institución
Resumen
Throughout the years, humans have developed a complex and intricate system of communication with several means of conveying information that range from books, newspapers and television to, more recently, social media. However, efficiently retrieving and understanding messages from social media for extracting useful information is challenging, especially considering that shorter messages are strongly dependent on context. Users often assume that their social media audience is aware of the associated background and the underlying real world events. This allows them to shorten their messages without compromising the effectiveness of communication. Traditional data mining algorithms do not account for contextual information. We argue that exploiting context could lead to more complete and accurate analyses of social media messages. For this work, therefore, we demonstrate how relevant is contextual information in the successful filtering of messages that are related to a selected subject. We also show that recall rate increases if context is taken into account. Furthermore, we propose methods for filtering relevant messages without resorting only to keywords if the context is known and can be detected. In this dissertation, we propose a novel approach for subject classification of social media messages that considers both textual and extra-textual (or contextual) information. This approach uses a proposed context-enriched language model. Techniques based on concepts of computational linguistics, more specifically in the field of Pragmatics, are employed. For experimentally analyzing the impact of the proposed approach, datasets containing messages about three major American sports (football, baseball and basketball) were used. Results indicate up to 50% improvement in retrieval over text-based approaches due to the use of contextual information.