Búsqueda avanzada

El autor David Insa ha publicado 8 artículo(s):

1 - A Collection of Website Benchmarks Labelled for Template Detection and Content Extraction

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

2 - Site-Level Template Extraction Based on Hyperlink Analysis (Original Work)

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

3 - Mejora del rendimiento de la depuración declarativa mediante expansión y compresión de bucles

Uno de los principales objetivos en la depuración es reducir al máximo el tiempo necesario para encontrar los errores. En la depuración declarativa este tiempo depende en gran medida del número de preguntas realizadas al usuario por el depurador y, por tanto, reducir el número de preguntas generadas es un objetivo prioritario. En este trabajo demostramos que transformar los bucles del programa a depurar puede tener una influencia importante sobre el rendimiento del depurador. Concretamente, introducimos dos algoritmos que expanden y comprimen la representación interna utilizada por los depuradores declarativos para representar bucles. El resultado es una serie de transformaciones que pueden realizarse automáticamente antes de que el usuario intervenga en la depuración y que producen una mejora considerable a un coste muy bajo.

Autores: David Insa / Josep Silva / César Tomás / 
Palabras Clave: á rbol de ejecución - Depuración declarativa - Loop Expansion - Tree Compression

4 - How to construct a suite of program slices

Program slicing is a technique to extract the part of a program (the slice) that influences or is influenced by a set of variables at a given point. Computing minimal slices is undecidable in the general case. Obtaining the minimal slice of a given program is computationally prohibitive even for very small programs. Hence, no matter what program slicer we use, in general, we cannot be sure that our slices are minimal. This is probably the fundamental reason why no benchmark collection of minimal program slices exists, even though this would be of great interest. In this work, we present the first suite of quasi-minimal slices (i.e., we cannot prove that they are minimal, but we provide technological evidences, based on different techniques, that they probably are). We explain the process of constructing the suite, the methodology and tools that were used, and the obtained results. The suite comes with a collection of Erlang benchmarks together with different slicing criteria and the associated quasi-minimal slices. This suite can be used to evaluate and compare program slicers, but it is particularly useful to develop slicers, because it contains scripts that allow for automatically validating a slicer against the whole suite. Concretely, these scripts produce reports about the impact on recall and precision of any change done during the development of the slicer.

Autores: David Insa  /  Sergio Pérez  /  Josep Silva / 
Palabras Clave:

5 - Behaviour Preservation across Code Versions in Erlang (Trabajo ya publicado)

In any alive and non-trivial program, the source code naturally evolves along the lifecycle for many reasons such as the implementation of new functionality, the optimisation of a bottle-neck, the refactoring of an obscure function, etc. Frequently, these code changes affect various different functions and modules, so it can be difficult to know whether the correct behaviour of the previous version has been preserved in the new version. In this paper, we face this problem in the context of the Erlang language, where most developers rely on a previously defined test suite to check the behaviour preservation. We propose an alternative approach to automatically obtain a test suite that specifically focusses on comparing the old and new versions of the code. Our test case generation is directed by a sophisticated combination of several already existing tools such as TypEr, CutEr, and PropEr; and it introduces novel ideas such as allowing the programmer to choose one or more expressions of interest that must preserve the behaviour, or the recording of the sequences of values to which those expressions are evaluated. All the presented work has been implemented in an open-source tool that is publicly available on GitHub.

Autores: David Insa / Sergio Pérez Rubio / Josep Silva / Salvador Tamarit / 
Palabras Clave: Automated regression testing - Code evolution control - Tracing

6 - Webpage Menu Detection Based on DOM (Trabajo ya publicado)

One of the key elements of a website is Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.

Autores: Julián Alarte Aleixandre / David Insa / Josep Silva / 
Palabras Clave: Information retrieval - Menu detection - Web template detection

7 -

8 -