AggFirstJoin: Optimizing Geo-Distributed Joins using Aggregation-Based Transformations

In Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2023)

Dhruv Kumar

University of Minnesota, Twin Cities

Sohaib Ahmad

University of Massachusetts, Amherst

Abhishek Chandra

University of Minnesota, Twin Cities

Principal Investigator

Ramesh Sitaraman

University of Massachusetts, Amherst

Principal Investigator

Abstract

Geo-distributed analytics (GDA) involves processing of data stored across geographically distributed sites. Such analytics involves data transfer over the wide area network (WAN) links. WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. To tackle this issue, recent approaches have proposed WAN-aware scheduling and placement of geo-distributed analytics tasks. However, computing joins in a geo-distributed setting remains a challenging problem. In this work, we propose AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. Our optimization approach takes a combined view of the join and aggregation operations which are often part of the same query and pushes (a transformed) aggregation before join in a manner to produce the same results as the original query. We augment our query transformation technique with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. We implement our proposed technique on top of Apache Spark, a popular engine for big data analytics. We extensively evaluate our proposed technique using synthetic, TPC-H and Amplab Big Data benchmark datasets on a real geo-distributed testbed on AWS as well as an emulated testbed. Our evaluations show our proposed technique achieves up to 300x reduction in query execution time and 200x reduction in WAN usage as compared to state-of-the-art GDA techniques.

This space for any disclamers, grant information, affiliations, etc.

Website made by Kanishk Kacholia