Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems





Publicado en

Actas de las XXVII Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2023)

Licencia Creative Commons


Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, spatial partitioning techniques, spatial indexes, and spatial operations (e.g., spatial range, nearest neighbor, and spatial join queries). It is actively under development by the Apache Software Foundation, and it has been recently graduated to as Apache Top Level Project. Other Spark-based spatial analytics systems have been also proposed in the literature, like Simba and LocationSpark, but currently they are not updated for long time. Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are used in numerous spatial applications (e.g., GIS, location-based systems, continuous monitoring streaming systems, etc.), but they are not supported by Apache Sedona. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning, spatial indexing, and other optimization techniques (e.g., repartitioning and less data). The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient (in terms of total execution time and memory requirements), scalable (varying k values, sizes of datasets and number of executors), and robust in Apache Sedona. Moreover, we have also experimentally compared Apache Sedona, LocationSpark and Simba, showing Apache Sedona the best performance for kCPQ in all cases, and for kNNJQ when the joined datasets are medium-sized, whereas LocationSpark is the winner for kNNJQ when the combined datasets are large-sized, and Simba shows the lowest performance in all considered cases. Finally, we can conclude that Apache Sedona shows the best performance for kCPQ and competitive results for kNNJQ.


Acerca de García-García, Francisco

Palabras clave

K Nearest Neighbor Join Query, K Closest Pairs Query, Distributed Spatial Data Processing, Apache Sedona, LocationSpark, Simba, Spatial Query Evaluation
Página completa del ítem
Notificar un error en este resumen
Mostrar cita
Mostrar cita en BibTeX
Descargar cita en BibTeX