Navegación

Búsqueda

Búsqueda avanzada

Resultados de búsqueda para Information retrieval

Page-Level Main Content Extraction from Heterogeneous Webpages

The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, comments, etc. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, computing time, etc. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, content adaptation to low resolutions, etc. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, animations, etc. It is a DOM-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it isefficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.Publicado en: ACM Transactions on Knowledge Discovery from Data. Año 2021

Autores: Julián Alarte / Josep Silva / 
Palabras Clave: Block Detection - Content Extraction - Information retrieval - Template Extraction - Web Mining

Aggregation-based information retrieval system for geospatial data catalogs

Tipo de contribución: Artículo relevanteAutores: Lacasta, Javier; Lopez-Pellicer, F. Javier; Espejo-García, Borja; Nogueras-Iso, Javier; Zarazaga-Soria, F. Javier.Título: Aggregation-based information retrieval system for geospatial data catalogs.Publicación: INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE. 31 – 8, pp. 1583 – 1605. 2017. ISSN 1365-8816DOI: 10.1080/13658816.2017.1319949Indicios de Calidad – Factor de impacto:JCR-SCI 2016 impact factor: 2.502. Rank: Q2 (46/146) in Computer Science, Information Systems ; Q2 (19/49) in Geography, Physical

Autores: Javier Lacasta / Francisco J. Lopez-Pellicer / Borja Espejo García / Javier Nogueras Iso / F.Javier Zarazaga-Soria / 
Palabras Clave: Catalog Service for the Web - Geospatial Data Catalog - Information retrieval - Spatial Data Infrastructure

Webpage Menu Detection Based on DOM (Trabajo ya publicado)

One of the key elements of a website is Web menus, which provide fundamental information about the topology of the own website. Menu detection is useful for humans, but also for crawlers and indexers because the menu provides essential information about the structure and contents of a website. For humans, identifying the main menu of a website is a relatively easy task. However, for computer tools identifying the menu is not trivial at all and, in fact, it is still a challenging unsolved problem. In this work, we propose a novel method for automatic Web menu detection that works at the level of DOM.

Autores: Julián Alarte Aleixandre / David Insa / Josep Silva / 
Palabras Clave: Information retrieval - Menu detection - Web template detection

No encuentra los resultados que busca? Prueba nuestra Búsqueda avanzada