Búsqueda avanzada

El autor Gonzalo Navarro ha publicado 3 artículo(s):

1 - Universal indexes for highly repetitive document collections

Abstract ======== Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower. Publication Details =================== Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, Gonzalo Navarro. Universal indexes for highly repetitive document collections Information Systems, 61, pp. 1-23, 2016, DOI: Citations Google Scholar: 3 (2 self-citations)

Autores: Francisco Claude / Antonio Fariña / Miguel A. Martinez-Prieto / Gonzalo Navarro / 
Palabras Clave: Inverted index - Repetitive collections - Self-index

2 - Compact Representations of Event Sequences

Brisaboa, N. R.; de Bernardo, Guillermo; Navarro, G.; Varela Rodeiro, T.; Seco, D.: «Compact Representations of Event Sequences», en Proceedings of the 2018 Data Compression Conference (DCC 2018), IEEE Computer Society, Snowbird, Utah (Estados Unidos), 2018, pp. 237-246.GGS Class: 2DOI: 10.1109/DCC.2018.00032We introduce a new technique for the efficient management of large sequences of multidimensional data, which takes advantage of regularities that arise in real-world datasets and supports different types of aggregation queries. More importantly, our representation is flexible in the sense that the relevant dimensions and queries may be used to guide the construction process, easily providing a space-time tradeoff depending on the relevant queries in the domain. We provide two alternative representations for sequences of multidimensional data and describe the techniques to efficiently store the datasets and to perform aggregation queries over the compressed representation. We perform experimental evaluation on realistic datasets, showing the space efficiency and query capabilities of our proposal.

Autores: Tirso Varela Rodeiro / Nieves Rodríguez Brisaboa / Gonzalo Navarro / Diego Seco / Guillermo de Bernardo Roca / 
Palabras Clave: Aggregation - Multidimensional data - Succinct data structures

3 - Two-Dimensional Block Trees

Brisaboa, N. R.; Gagie, T.; Gomez Brandon, A.; Navarro, G.: «Two-Dimensional Block Trees», en Proceedings of the 2018 Data Compression Conference (DCC 2018), IEEE Computer Society, Snowbird, Utah (Estados Unidos), 2018, pp. 227-236.GGS Class: 2DOI: 10.1109/DCC.2018.00031The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus the BT reduces the size by orders of magnitude. In this paper we extend the BT to two dimensions, to exploit repetitiveness in collections of images, graphs, and maps. This two-dimensional Block Tree divides the image regularly into subimages and replaces some of them by pointers to other occurrences thereof. We develop a specific variant aimed at compressing the adjacency matrices of Web graphs, obtaining space reductions of up to 50% compared with the k2-tree, which is the best alternative supporting direct and reverse navigation in the graph.

Autores: Nieves R. Brisaboa / Travis Gagie / Adrián Gómez Brandón  / Gonzalo Navarro / 
Palabras Clave: Compact data structures - Compression - Images - Web graphs