Autor:
Insa, David

Cargando...
Foto de perfil
E-mails conocidos
dinsa@dsic.upv.es
Fecha de nacimiento
Proyectos de investigación
Unidades organizativas
Puesto de trabajo
Apellidos
Insa
Nombre de pila
David
Nombre
Nombres alternativos
Afiliaciones conocidas
Departamento de Sistemas Informáticos y Computación. Universitat Politècnica de València, Spain
Universitat Politècnica de València , Spain
Universitat Politécnica de Valéncia Valencia, Spain
Páginas web conocidas
Página completa del ítem
Notificar un error en este autor

Resultados de la búsqueda

Mostrando 1 - 3 de 3
  • Artículo
    Page-Level Webpage Menu Detection
    Alarte, Julián; Insa, David; Sílva, Josep. Actas de las XVI Jornadas de Programación y Lenguajes (PROLE 2016), 2016-09-02.
    One of the key elements of a website are Web menus, which provide fundamental information about the structure of the own website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. From the point of view of crawlers and indexers, menu detection is a valuable technique, because processing the menu allows these tools to immediately find out the structure of the website. Identifying the menu is also essential for website mapping tasks. With the information of the menu, it is possible to build a sitemap that includes the main pages without having to follow all the links. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM. Our implementation and experiments demonstrate the usefulness of the technique.
  • Artículo
    How to construct a suite of program slices
    Insa, David; Pérez Rubio, Sergio; Sílva, Josep. Actas de las XVI Jornadas de Programación y Lenguajes (PROLE 2016), 2016-09-02.
    Program slicing is a technique to extract the part of a program (the slice) that influences or is influenced by a set of variables at a given point. Computing minimal slices is undecidable in the general case. Obtaining the minimal slice of a given program is computationally prohibitive even for very small programs. Hence, no matter what program slicer we use, in general, we cannot be sure that our slices are minimal. This is probably the fundamental reason why no benchmark collection of minimal program slices exists, even though this would be of great interest. In this work, we present the first suite of quasi-minimal slices (i.e., we cannot prove that they are minimal, but we provide technological evidences, based on different techniques, that they probably are). We explain the process of constructing the suite, the methodology and tools that were used, and the obtained results. The suite comes with a collection of Erlang benchmarks together with different slicing criteria and the associated quasi-minimal slices. This suite can be used to evaluate and compare program slicers, but it is particularly useful to develop slicers, because it contains scripts that allow for automatically validating a slicer against the whole suite. Concretely, these scripts produce reports about the impact on recall and precision of any change done during the development of the slicer.
  • Artículo
    A Collection of Website Benchmarks Labelled for Template Detection and Content Extraction
    Alarte, Julián; Insa, David; Sílva, Josep; Tamarit, Salvador. Actas de las XV Jornadas de Programación y Lenguajes (PROLE 2015), 2015-09-15.
    Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objectives are different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.