Navegación

Búsqueda

Búsqueda avanzada

El autor Josep Silva ha publicado 8 artículo(s):

1 - A Collection of Website Benchmarks Labelled for Template Detection and Content Extraction

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

2 - Site-Level Template Extraction Based on Hyperlink Analysis (Original Work)

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

3 - Mejora del rendimiento de la depuración declarativa mediante expansión y compresión de bucles

Uno de los principales objetivos en la depuración es reducir al máximo el tiempo necesario para encontrar los errores. En la depuración declarativa este tiempo depende en gran medida del número de preguntas realizadas al usuario por el depurador y, por tanto, reducir el número de preguntas generadas es un objetivo prioritario. En este trabajo demostramos que transformar los bucles del programa a depurar puede tener una influencia importante sobre el rendimiento del depurador. Concretamente, introducimos dos algoritmos que expanden y comprimen la representación interna utilizada por los depuradores declarativos para representar bucles. El resultado es una serie de transformaciones que pueden realizarse automáticamente antes de que el usuario intervenga en la depuración y que producen una mejora considerable a un coste muy bajo.

Autores: David Insa / Josep Silva / César Tomás / 
Palabras Clave: á rbol de ejecución - Depuración declarativa - Loop Expansion - Tree Compression

4 - Behaviour Preservation across Code Versions in Erlang (Trabajo ya publicado)

In any alive and non-trivial program, the source code naturally evolves along the lifecycle for many reasons such as the implementation of new functionality, the optimisation of a bottle-neck, the refactoring of an obscure function, etc. Frequently, these code changes affect various different functions and modules, so it can be difficult to know whether the correct behaviour of the previous version has been preserved in the new version. In this paper, we face this problem in the context of the Erlang language, where most developers rely on a previously defined test suite to check the behaviour preservation. We propose an alternative approach to automatically obtain a test suite that specifically focusses on comparing the old and new versions of the code. Our test case generation is directed by a sophisticated combination of several already existing tools such as TypEr, CutEr, and PropEr; and it introduces novel ideas such as allowing the programmer to choose one or more expressions of interest that must preserve the behaviour, or the recording of the sequences of values to which those expressions are evaluated. All the presented work has been implemented in an open-source tool that is publicly available on GitHub.

Autores: David Insa / Sergio Pérez Rubio / Josep Silva / Salvador Tamarit / 
Palabras Clave: Automated regression testing - Code evolution control - Tracing

5 - Webpage Menu Detection Based on DOM (Trabajo ya publicado)

One of the key elements of a website is Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.

Autores: Julián Alarte Aleixandre / David Insa / Josep Silva / 
Palabras Clave: Information retrieval - Menu detection - Web template detection

6 -

7 -

8 - Automatic Testing of Program Slicers

Program slicing is a technique to extract the part of a program (the slice) that influences or is influenced by a set of variables at a given point (the slicing criterion). Computing minimal slices is undecidable in the general case, and obtaining the minimal slice of a given program is normally computationally prohibitive even for very small programs. Therefore, no matter what program slicer we use, in general, we cannot be sure that our slices are minimal. This is probably the fundamental reason why no benchmark collection of minimal program slices exists. In this work, we present a method to automatically produce quasi-minimal slices. Using our method, we have produced a suite of quasi-minimal slices for Erlang that we have later manually proved they are minimal. We explain the process of constructing the suite, the methodology and tools that were used, and the results obtained. The suite comes with a collection of Erlang benchmarks together with different slicing criteria and the associated minimal slices.

Autores: Sergio Pérez / Josep Sílva / Salvador Tamarit / 
Palabras Clave: Erlang - Program analysis - Program Slicing - Testing