Navegación

Búsqueda

Búsqueda avanzada

El autor Julián Alarte ha publicado 3 artículo(s):

1 - A Collection of Website Benchmarks Labelled for Template Detection and Content Extraction

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

2 - Site-Level Template Extraction Based on Hyperlink Analysis (Original Work)

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

Autores: Julián Alarte / David Insa / Josep Silva / Salvador Tamarit / 
Palabras Clave:

3 - Page-Level Webpage Menu Detection

One of the key elements of a website are Web menus, which provide fundamental information about the structure of the own website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. From the point of view of crawlers and indexers, menu detection is a valuable technique, because processing the menu allows these tools to immediately find out the structure of the website. Identifying the menu is also essential for website mapping tasks. With the information of the menu, it is possible to build a sitemap that includes the main pages without having to follow all the links. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM. Our implementation and experiments demonstrate the usefulness of the technique.

Autores: Julián Alarte /  David Insa  /  Josep Silva / 
Palabras Clave: