Compressed suffix trees for repetitive collections based on block trees

Cáceres Reyes, Manuel Ariel

Tesis

Fecha

2019

Registro en:

https://repositorio.uchile.cl/handle/2250/172755

Autor

Cáceres Reyes, Manuel Ariel

Institución

Universidad de Chile

Resumen

The Block Tree is a recently proposed data structure representing a sequence T of length n in space bounded by the number of phrases z of the Lempel-Ziv parsing of T. It uses O(z log(n/z)) space and supports access to symbols and other operations in logarithmic time. We contribute both to the theoretical and practical development of Block Trees, proving new properties, presenting the first implementation faithful to its theoretical description, making further improvements to the structure, and studying a wide set of variants. Suffix trees are a fundamental data structure in stringology with a myriad of applications in areas like bioinformatics, enabling efficient solutions to complex problems such as (approximate) pattern matching, finding repeated substrings, computing matching statistics, computing maximal matches, and many others. However, their space usage, though linear, is an important problem in applications. A line of research addressing this problem consists on building compact representations of suffix trees, named Compressed Suffix Trees (CSTs), which simulate all the suffix tree functionality within space bounded by the information content of the sequence. Current implementations use 10 bits per symbol and support the operations in a few microseconds; or use 5 bits per symbol but their operation times raise to milliseconds. A recent branch of this area is repetition-aware CSTs, which focuses on building suffix trees for repetitive inputs, such as a versioned document or a collection of DNA sequences of a group of humans. CSTs in this area adapt Lempel-Ziv, Grammar and run-length based indexes. The most successful CST in practice is the grammar-compressed suffix tree (GCST), which uses about 2 bits per symbol and supports the operations in tens to hundreds of microseconds. We apply our Block Trees to design a new repetition-aware CST that, though larger than the GCST (i.e., using 3--4 bits per symbol), is faster by orders of magnitude. We obtain this result by using Block Trees on various components of the CST. First, we build them on the parentheses representation of the tree topology; these Block Trees must be augmented to support tree navigation operations on the parentheses. Second, we use Block Trees on various differentially-encoded arrays that compose a CST: the suffix array, its inverse, and the longest-common-prefix array. We end up with a set of recommended combinations for practitioners. All our code is publicly available at https://github.com/elarielcl/BT-CST.

Materias

Compresión de datos (Ciencia de la computación)

Bioinformática

Suffix trees

Mostrar el registro completo del ítem