Tesis
Large Scale Topic Modeling Using Search Queries: An Information-Theoric Approach
Autor
Ramírez Rangel, Eduardo H.
Institución
Resumen
Creating topic models of text collections is an important step towards more adaptive information access and retrieval applications. Such models encode knowledge of the topics discussed on a collection, the documents that belong to each topic and the semantic similarity of a given pair of topics. Among other things, they can be used to focus or disambiguate search queries and construct visualizations to navigate across the collection. So far, the dominant paradigm to topic modeling has been the Probabilistic Topic Modeling approach in which topics are represented as probability distributions of terms, and documents are assumed to be generated from a mixture of random topics. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as freely overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Then, we propose the Querybased Topic Modeling framework (QTM), an information-theoretic method that assumes the existence of a "golden" set of queries that can capture most of the semantic information of the collection and produce models with máximum semantic coherence. The QTM method uses information-theoretic heuristics to find a set of "topical-queries" which are then co-clustered along with the documents of the collection and transformed to produce overlapping document clusters. The QTM framework was designed with scalability in mind and is able to be executed in parallel over commodity-class machines using the Map-Reduce approach. Then, in order to compare the QTM results with models generated by other methods we have developed metrics that formalize the notion of semantic coherence using probabilistic concepts and the familiar notions of recall and precisión. In contrast to traditional clustering metrics, the proposed metrics have been generalized to validate overlapping and potentially incomplete clustering solutions using multi-labeled corpora. We use them to experimentally validate our query-based approach, showing that models produced using selected queries outperform the ones produced using the collection vocabulary. Also, we explore the heuristics and settings that determine the performance of QTM and show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.