Búsqueda avanzada

Improving Distance-Join Query Processing with Voronoi-Diagram based Partitioning in SpatialHadoop

SpatialHadoop is an extended MapReduce framework supporting global indexing techniques that partition spatial datasets across several machines and improve spatial query processing performance compared to traditional Hadoop systems. SpatialHadoop supports several spatial operations (e.g.,K Nearest Neighbor search, range query, spatial intersection join, etc.) and seven spatial partitioning techniques (Grid, Quadtree, STR, STR+ACs, k-d tree, Z-curve and Hilbert-curve). Distance-Join Queries (DJQs), like the K Nearest Neighbors Join Query (KNNJQ) and K Closest Pairs Query (KCPQ), are common operations used in numerous spatial applications. DJQs are costly operations, since they combine spatial joins with distance-based search. Data partitioning improves the management of large datasets and speeds up query performance.Therefore, performing DJQs efficiently with new partitioning methods in SpatialHadoop is a challenging task. In this paper, a new data partitioning technique based on Voronoi-Diagrams is designed and implemented in SpatialHadoop. Moreover, improved KNNJQ and KCPQ MapReduce algorithms, using the new partitioning mechanism, are also designed and developed for SpatialHadoop. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the new partitioning technique and the improved DJQ MapReduce algorithms are efficient, scalable and robust in SpatialHadoop.

Automatic Testing of Design Faults in MapReduce Applications

New processing models are being adopted in Big Data engineering to overcome the limitations of traditional technology. Among them, MapReduce stands out by allowing for the processing of large volumes of data over a distributed infrastructure that can change during runtime. The developer only designs the functionality of the program and its execution is managed by a distributed system. As a consequence, a program can behave differently at each execution because it is automatically adapted to the resources available at each moment. Therefore, when the program has a design fault, this could be revealed in some executions and masked in others. However, during testing, these faults are usually masked because the test infrastructure is stable, and they are only revealed in production because the environment is more aggressive with infrastructure failures, among other reasons. This paper proposes new testing techniques that aimed to detect these design faults by simulating different infrastructure configurations. The testing techniques generate a representative set of infrastructure configurations that as whole are more likely to reveal failures using random testing, and partition testing together with combinatorial testing. The techniques are automated by using a test execution engine called MRTest that is able to detect these faults using only the test input data, regardless of the expected output. Our empirical evaluation shows that MRTest can automatically detect these design faults within a reasonable time.

Efficient Large-scale Distance-Based Join Queries in SpatialHadoop

Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains (e.g. image processing, location-based systems, geographical information systems (GIS), continuous monitoring in streaming data settings, road network systems, etc.). The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the e Distance Join Query (eDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (e) between the components of the pairs in the nal result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints, and they become even more costly operations for large-scale data. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop-MapReduce that supports efficient processing of spatial queries in a cloud-based setting. SpatialHadoop injects spatial data awareness in each Hadoop layer, i.e. language, storage, MapReduce and operations layers.We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. In addition to the plane-sweep base technique, we present a methodology for improving the performance of the KCPQ algorithms by the computation of an upper bound of the distance of the K-th closest pair. To demonstrate the benets of our proposed methodologies, we present the results of the execution of an extensive set of experiments that demonstrate the efficiency and scalability of our proposals using big synthetic and real-world points datasets.

Automatización de la localización de defectos en el diseño de aplicaciones MapReduce

Los programas MapReduce analizan grandes cantidades de datos sobre una infraestructura distribuida. En cambio, estos programas pueden desarrollarse independientemente de la infraestructura ya que un framework gestiona automáticamente la asignación de recursos y la gestión de fallos. Una vez que se detecta un defecto, suele ser complicado localizar su causa raíz ya que diversas funciones se ejecutan simultáneamente en una infraestructura distribuida que cambia continuamente y que es difícil tanto de controlar como depurar. En este artículo se describe una técnica que, a partir de un caso de prueba que produce fallo, localiza su causa raíz analizando dinámicamente las características del diseño que se cubren cuando se produce fallo y aquellas que no.

Localización de defectos en aplicaciones MapReduce

Los programas que analizan grandes cantidades de datos suelen ejecutarse en entornos distribuidos, tal y como ocurre con las aplicaciones MapReduce. Estos programas se desarrollan independientemente de la infraestructura sobre que la que se ejecutan, ya que un framework gestiona automáticamente la asignación de recursos y gestión de fallos, entre otros. Detectar y localizar defectos en estos programas suele ser una tarea compleja ya que diversas funciones se ejecutan simultáneamente en una infraestructura distribuida, difícil de controlar y que cambia continuamente. En este artículo se describe una técnica que, a partir de un fallo detectado en las pruebas, localiza defectos de diseño analizando dinámicamente los parámetros que lo causan.

Pruebas funcionales en programas MapReduce basadas en comportamientos no esperados

MapReduce es un paradigma de programación que permite el procesamiento paralelo de grandes cantidades de datos. Los programas MapReduce se suelen ejecutar sobre el framework Hadoop, el cual no garantiza que se ejecuten siempre en las mismas condiciones, pudiendo producir comportamientos no esperados desde el punto de vista de su funcionalidad. En este artículo se analizan y describen diferentes tipos de defectos específicos que pueden estar presentes en programas MapReduce sobre Hadoop y se muestra cómo se pueden derivar casos de prueba que permiten la detección de dichos defectos. Lo anterior se ilustra sobre varios programas de ejemplo.

Pruebas basadas en flujo de datos para programas MapReduce

MapReduce es un paradigma de procesamiento masivo de información. Estos programas realizan varias transformaciones de los datos hasta que se obtiene la salida representando la lógica de negocio del programa. En este artículo se elabora una técnica de prueba basada en data flow y que deriva las pruebas a partir de las transformaciones que ocurren en el programa. Se muestran resultados de la ejecución de los casos de prueba derivados de la aplicación de la técnica, los cuales permiten detectar algunos defectos.