Due to the ubiquitous use of spatial data applications and the large amounts of such data these applications use, the processing of large-scale distance joins in distributed systems is becoming increasingly popular. Distance Join Queries(DJQs) are important and frequently used operations in numerous applications, including data mining, multi-media and spatial databases. DJQs (e.g., k Nearest Neighbor Join Query, k Closest Pair Query, +A7U Distance Join Query, etc.) are costly operations, since they involve both the join and distance-based search, and performing DJQs efficiently is a challenging task. Recent Big Data developments have motivated the emergence of novel technologies for distributed processing of large-scale spatial data in clusters of computers, leading to Distributed Spatial Data Management Systems(DSDMSs). Distributed cluster-based computing systems can be classified as Hadoop-based or Spark-based systems. Based on this classification, in this paper, we compare two of the most recent and leading DSDMSs, SpatialHadoop and LocationSpark, by evaluating the performance of several existing and newly proposed parallel and distributed DJQ algorithms under various settings with large spatial real-world datasets. A general conclusion arising from the execution of the distributed DJQ algorithms studied is that, while SpatialHadoop is a robust and efficient system when large spatial datasets are joined (since it is built on top of the mature Hadoop platform), LocationSpark is the clear winner in total execution time efficiency when medium spatial datasets are combined (due to in-memory processing provided by Spark). However, LocationSpark requires higher memory allocation when large spatial datasets are involved in DJQs (even more so when k and +A7U are large). Finally, this detailed performance study has demonstrated that the new distributed DJQ algorithms we have pro-posed are efficient, robust and scalable with respect to different parameters, such as dataset sizes, k, +A7U and number of computing nodes.
Autores: Francisco Garcia-Garcia / Antonio Corral / Luis Iribarne / Michael Vassilakopoulos / Yannis Manolopoulos /
Palabras Clave: Distance Join - LocationSpark - Space Partitioning - Spatial Data Processing - Spatial Query Evaluation - SpatialHadoop
Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains (e.g. image processing, location-based systems, geographical information systems (GIS), continuous monitoring in streaming data settings, road network systems, etc.). The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the e Distance Join Query (eDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (e) between the components of the pairs in the nal result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints, and they become even more costly operations for large-scale data. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop-MapReduce that supports efficient processing of spatial queries in a cloud-based setting. SpatialHadoop injects spatial data awareness in each Hadoop layer, i.e. language, storage, MapReduce and operations layers.We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. In addition to the plane-sweep base technique, we present a methodology for improving the performance of the KCPQ algorithms by the computation of an upper bound of the distance of the K-th closest pair. To demonstrate the benets of our proposed methodologies, we present the results of the execution of an extensive set of experiments that demonstrate the efficiency and scalability of our proposals using big synthetic and real-world points datasets.
Autores: Antonio Corral / Francisco Garcia-Garcia / Luis Iribarne / Michael Vassilakopoulos / Yannis Manolopoulos /
Palabras Clave: eDJQ - KCPQ - MapReduce - Spatial Data Processing - Spatial Query Evaluation - SpatialHadoop
SpatialHadoop is an extended MapReduce framework supporting global indexing techniques that partition spatial datasets across several machines and improve spatial query processing performance compared to traditional Hadoop systems. SpatialHadoop supports several spatial operations (e.g.,K Nearest Neighbor search, range query, spatial intersection join, etc.) and seven spatial partitioning techniques (Grid, Quadtree, STR, STR+ACs, k-d tree, Z-curve and Hilbert-curve). Distance-Join Queries (DJQs), like the K Nearest Neighbors Join Query (KNNJQ) and K Closest Pairs Query (KCPQ), are common operations used in numerous spatial applications. DJQs are costly operations, since they combine spatial joins with distance-based search. Data partitioning improves the management of large datasets and speeds up query performance.Therefore, performing DJQs efficiently with new partitioning methods in SpatialHadoop is a challenging task. In this paper, a new data partitioning technique based on Voronoi-Diagrams is designed and implemented in SpatialHadoop. Moreover, improved KNNJQ and KCPQ MapReduce algorithms, using the new partitioning mechanism, are also designed and developed for SpatialHadoop. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the new partitioning technique and the improved DJQ MapReduce algorithms are efficient, scalable and robust in SpatialHadoop.
Autores: Francisco Garcia-Garcia / Antonio Corral / Luis Iribarne / Michael Vassilakopoulos /
Palabras Clave: Data Partitioning - K Closest Pairs - K Nearest Neighbors Join - MapReduce - Spatial Query Evaluation - SpatialHadoop