Búsqueda avanzada

El autor Francisco Claude ha publicado 2 artículo(s):

1 - Practical Representations for Web and Social Graphs

Graphs are a natural way of modeling connections among Web pages in a network or people according to a criteria like friendship, co-authorship, etc. Many algorithms that compute and infer interesting facts out of these networks work directly over these graphs. Some examples of this are Connected components, HITS, PageRank, spamdetection, among others. In social networks, graph mining also enables the study of populations behavior. Successful graph mining not only enables segmentation of users but also prediction of behavior. Link analysis and graph mining remains an area of high interest and development.
These human-generated graphs are growing at an amazing pace, and their representation in main memory, secondary memory, and distributed systems are getting more and more attention. Furthermore, space-efficient representations for these graphs have succeeded at exploiting regularities that are particular to the domain. In the case of Web graphs the main properties exploited are the locality of reference, skewed in/out-degree, and similarity of adjacency lists among nodes of the same domain. In social networks, there is a tendency towards forming cliques in the network.

Autores: Francisco Claude / usana Ladra / 
Palabras Clave:

2 - Universal indexes for highly repetitive document collections

Abstract ======== Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower. Publication Details =================== Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, Gonzalo Navarro. Universal indexes for highly repetitive document collections Information Systems, 61, pp. 1-23, 2016, DOI: Citations Google Scholar: 3 (2 self-citations)

Autores: Francisco Claude / Antonio Fariña / Miguel A. Martinez-Prieto / Gonzalo Navarro / 
Palabras Clave: Inverted index - Repetitive collections - Self-index