AggFirstJoin: Optimizing Geo-Distributed Joins using Aggregation-Based Transformations
In Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2023)
Dhruv Kumar
University of Minnesota, Twin Cities
Sohaib Ahmad
University of Massachusetts, Amherst
Abhishek Chandra
University of Minnesota, Twin Cities
Principal Investigator
Ramesh Sitaraman
University of Massachusetts, Amherst
Principal Investigator
Abstract
Geo-distributed analytics (GDA) involves processing of data stored across geographically distributed sites. Such analytics involves data transfer over the wide area network (WAN) links. WAN links are highly constrained and heterogeneous in nature, making the data transfer over the WAN slow and costly. To tackle this issue, recent approaches have proposed WAN-aware scheduling and placement of geo-distributed analytics tasks. However, computing joins in a geo-distributed setting remains a challenging problem. In this work, we propose AggFirstJoin, an approach to minimize the cost of geo-distributed joins using a theoretically sound query transformation technique. Our optimization approach takes a combined view of the join and aggregation operations which are often part of the same query and pushes (a transformed) aggregation before join in a manner to produce the same results as the original query. We augment our query transformation technique with a WAN-aware task placement and a Bloom filtering approach to further reduce query execution time and WAN usage respectively. We implement our proposed technique on top of Apache Spark, a popular engine for big data analytics. We extensively evaluate our proposed technique using synthetic, TPC-H and Amplab Big Data benchmark datasets on a real geo-distributed testbed on AWS as well as an emulated testbed. Our evaluations show our proposed technique achieves up to 300x reduction in query execution time and 200x reduction in WAN usage as compared to state-of-the-art GDA techniques.
This space for any disclamers, grant information, affiliations, etc.
Website made by Kanishk Kacholia