Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

In Proceedings of the 14th ACM Symposium on Cloud Computing (SoCC 2023)

Joel Wolfrath

University of Minnesota, Twin Cities

Abhishek Chandra

University of Minnesota, Twin Cities

Principal Investigator

Abstract

Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling---computing a uniform sample from the join results---is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.

This space for any disclamers, grant information, affiliations, etc.

Website made by Kanishk Kacholia