Búsqueda avanzada

El autor Josep Silva ha publicado 11 artículo(s):

1 - A Collection of Website Benchmarks Labelled for Template Detection and Content Extraction

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

2 - Site-Level Template Extraction Based on Hyperlink Analysis (Original Work)

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

3 - Mejora del rendimiento de la depuración declarativa mediante expansión y compresión de bucles

Uno de los principales objetivos en la depuración es reducir al máximo el tiempo necesario para encontrar los errores. En la depuración declarativa este tiempo depende en gran medida del número de preguntas realizadas al usuario por el depurador y, por tanto, reducir el número de preguntas generadas es un objetivo prioritario. En este trabajo demostramos que transformar los bucles del programa a depurar puede tener una influencia importante sobre el rendimiento del depurador. Concretamente, introducimos dos algoritmos que expanden y comprimen la representación interna utilizada por los depuradores declarativos para representar bucles. El resultado es una serie de transformaciones que pueden realizarse automáticamente antes de que el usuario intervenga en la depuración y que producen una mejora considerable a un coste muy bajo.

Autores: David Insa / Josep Silva / César Tomás / 
Palabras Clave: á rbol de ejecución - Depuración declarativa - Loop Expansion - Tree Compression

4 - Behaviour Preservation across Code Versions in Erlang (Trabajo ya publicado)

In any alive and non-trivial program, the source code naturally evolves along the lifecycle for many reasons such as the implementation of new functionality, the optimisation of a bottle-neck, the refactoring of an obscure function, etc. Frequently, these code changes affect various different functions and modules, so it can be difficult to know whether the correct behaviour of the previous version has been preserved in the new version. In this paper, we face this problem in the context of the Erlang language, where most developers rely on a previously defined test suite to check the behaviour preservation. We propose an alternative approach to automatically obtain a test suite that specifically focusses on comparing the old and new versions of the code. Our test case generation is directed by a sophisticated combination of several already existing tools such as TypEr, CutEr, and PropEr; and it introduces novel ideas such as allowing the programmer to choose one or more expressions of interest that must preserve the behaviour, or the recording of the sequences of values to which those expressions are evaluated. All the presented work has been implemented in an open-source tool that is publicly available on GitHub.

Autores: David Insa / Sergio Pérez Rubio / Josep Silva / Salvador Tamarit / 
Palabras Clave: Automated regression testing - Code evolution control - Tracing

5 - Webpage Menu Detection Based on DOM (Trabajo ya publicado)

One of the key elements of a website is Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.

Autores: Julián Alarte Aleixandre / David Insa / Josep Silva / 
Palabras Clave: Information retrieval - Menu detection - Web template detection

6 -

7 -

8 - Automatic Testing of Program Slicers

Program slicing is a technique to extract the part of a program (the slice) that influences or is influenced by a set of variables at a given point (the slicing criterion). Computing minimal slices is undecidable in the general case, and obtaining the minimal slice of a given program is normally computationally prohibitive even for very small programs. Therefore, no matter what program slicer we use, in general, we cannot be sure that our slices are minimal. This is probably the fundamental reason why no benchmark collection of minimal program slices exists. In this work, we present a method to automatically produce quasi-minimal slices. Using our method, we have produced a suite of quasi-minimal slices for Erlang that we have later manually proved they are minimal. We explain the process of constructing the suite, the methodology and tools that were used, and the results obtained. The suite comes with a collection of Erlang benchmarks together with different slicing criteria and the associated minimal slices.

Autores: Sergio Pérez / Josep Sílva / Salvador Tamarit / 
Palabras Clave: Erlang - Program analysis - Program Slicing - Testing

9 - Object Variable Dependencies in Object-Oriented Programs

The System Dependence Graph (SDG) is a program representation that is the basis of program slicing, a technique that extracts the part of the program that can affect the values computed at a given program point (known as the slicing criterion). Several approaches have enhanced the SDG representation to deal with object-oriented programs, being the most advanced approach the Java System Dependence Graph (JSysDG). In this paper, we show that even the JSysDG does not produce complete slices in all cases when some object variables are selected as the slicing criterion. Then, we extend the JSysDG with the addition of a specific flow dependence for object type variables called object- flow dependence. This extension provides a more accurate flow representation between object variables and its data members and allows us to obtain complete slices when an object variable is considered as slicing criterion.

Autores: Carlos Galindo / Sergio Perez / Josep Silva / 
Palabras Clave: Flow dependence - JSysDG - Object-flow dependence - Program Slicing

10 - Conditional Control Dependence to Represent Catch Statements in the System Dependence Graph

Program slicing is a technique for program analysis and transformation with many different applications such as program debugging, program specialization, and parallelization. The system dependence graph (SDG) is the most commonly used data structure for program slicing. In this paper, we show that the presence of exception-handling constructs can make the SDG produce incorrect and sometimes even incomplete slices. We showcase the instances of incorrectness and incompleteness and we propose a new kind of control dependence: conditional control dependence; which produces more precise slices in the presence of catch statements.

Autores: Carlos Galindo / Sergio Perez Rubio / Josep Silva / 
Palabras Clave: conditional control dependence - control dependence - exception handling - Program Slicing - system dependence graph

11 - Page-Level Main Content Extraction from Heterogeneous Webpages

The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, comments, etc. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, computing time, etc. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, content adaptation to low resolutions, etc. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, animations, etc. It is a DOM-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it isefficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.Publicado en: ACM Transactions on Knowledge Discovery from Data. Año 2021

Autores: Julián Alarte / Josep Silva / 
Palabras Clave: Block Detection - Content Extraction - Information retrieval - Template Extraction - Web Mining