DCSG
Distributed Computing Systems Group

Geo-distributed MapReduce

Motivation

Data are increasingly generated and stored in a geographically distributed manner. Examples range from Web applications such as social media sites and the content delivery networks (CDNs) that serve their content, all the way to distributed sensors in mobile devices such as smartphones and embedded devices such as those for climate monitoring. Each of these uses generates a large volume of data in a geographically distributed manner.

To extract knowledge from this data, many applications need to efficiently analyze and process this geo-distributed data, and often the resources available for this computation are also geographically distributed. As a result, we need to determine how to employ distributed resources to efficiently process geo-distributed data.

Challenges

Image of purely centralied approach

At one extreme, distributed data inputs can be sent to a single central data center for processing, but this might be prohibitely slow or excessively costly in terms of bandwidth or in terms of the need to fit all data in one location.

Image of purely distributed approach

At the other extreme, computation can be mapped onto the input data in situ, yielding intermediate results that typically require further aggregation in order to compute a final result. If this intermediate data volume is larger than the input data volume, than this aggregation might be more costly than moving the data to a central site in the first place.

Indeed neither of these extremes is the right answer in all cases. Instead, the best approach typically lies somewhere along the continuum between the two extremes.

Approach

We have used MapReduce as a vehicle to explore this spectrum and develop techniques to determine task and data placement in order to reduce MapReduce job execution time. This work falls into three categories:

Publications

Cloud Computing For Data-Intensive Appliactions: Book Cover Thumbnail

Cross-Phase Optimization in MapReduce. Benjamin Heintz, Abhishek Chandra, and Jon Weissman. Book chapter in Cloud Computing for Data-Intensive Applications. Springer. .

End-to-end Optimization Paper thumbnail

End-to-End Optimization for Geo-Distributed MapReduce. Benjamin Heintz, Abhishek Chandra, Ramesh K. Sitaraman, and Jon Weissman. In IEEE Transactions on Cloud Computing.

Cross-phase Optimization Paper thumbnail

Cross-phase Optimization in MapReduce. Benjamin Heintz, Chenyu Wang, Abhishek Chandra, and Jon Weissman. In the Proceedings of the IEEE International Conference on Cloud Engineering (IC2E). Redwood City, CA. .

End-to-end MapReduce Poster thumbnail

Improving MapReduce Performance in Highly Distributed Environments using End-to-End Optimization. Benjamin Heintz, Abhishek Chandra, and Ramesh K. Sitaraman. Poster at the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Hollywood, CA .

End-to-end MapReduce Tech Report thumbnail

Optimizing MapReduce for Highly Distributed Environments. Benjamin Heintz, Abhishek Chandra, and Ramesh K. Sitaraman. Technical Report 12-003. Department of Computer Science and Engineering, University of Minnesota. .

Exploring MapReduce Efficiency Paper thumbnail

Exploring MapReduce Efficiency with Highly-Distributed Data. Michael Cardosa, Chenyu Wang, Anshuman Nangia, Abhishek Chandra, and Jon Weissman. In the Proceedings of the Second International Workshop on MapReduce and its Applications (MAPREDUCE). San Jose, CA. .