Autor:
Alarte, Julián

Cargando...
Foto de perfil

E-mails conocidos

jualal@doctor.upv.es
jalarte@dsic.upv.es

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Alarte

Nombre de pila

Julián

Nombre

Nombres alternativos

Afiliaciones conocidas

Departamento de Sistemas Informáticos y Computación. Universitat Politècnica de València, Spain
DSIC UPV, Spain
Universitat Politécnica de Valéncia Valencia, Spain

Páginas web conocidas

Página completa del ítem
Notificar un error en este autor

Resultados de la búsqueda

Mostrando 1 - 3 de 3
  • Artículo
    Page-Level Main Content Extraction from Heterogeneous Webpages
    Alarte, Julián; Silva, Josep. Actas de las XX Jornadas de Programación y Lenguajes (PROLE 2021), 2021-09-22.
    The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, comments, etc. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, computing time, etc. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, content adaptation to low resolutions, etc. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, animations, etc. It is a DOM-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques. Publicado en: ACM Transactions on Knowledge Discovery from Data. Año 2021
  • Artículo
    A Collection of Website Benchmarks Labelled for Template Detection and Content Extraction
    Alarte, Julián; Insa, David; Silva, Josep; Tamarit, Salvador. Actas de las XV Jornadas de Programación y Lenguajes (PROLE 2015), 2015-09-15.
    Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.
  • Artículo
    Page-Level Webpage Menu Detection
    Alarte, Julián; Insa, David; Silva, Josep. Actas de las XVI Jornadas de Programación y Lenguajes (PROLE 2016), 2016-09-02.
    One of the key elements of a website are Web menus, which provide fundamental information about the structure of the own website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. From the point of view of crawlers and indexers, menu detection is a valuable technique, because processing the menu allows these tools to immediately find out the structure of the website. Identifying the menu is also essential for website mapping tasks. With the information of the menu, it is possible to build a sitemap that includes the main pages without having to follow all the links. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM. Our implementation and experiments demonstrate the usefulness of the technique.