Geo-distributed MapReduce
Motivation
Data are increasingly generated and stored in a geographically distributed manner. Examples range from Web applications such as social media sites and the content delivery networks (CDNs) that serve their content, all the way to distributed sensors in mobile devices such as smartphones and embedded devices such as those for climate monitoring. Each of these uses generates a large volume of data in a geographically distributed manner.
To extract knowledge from this data, many applications need to efficiently analyze and process this geo-distributed data, and often the resources available for this computation are also geographically distributed. As a result, we need to determine how to employ distributed resources to efficiently process geo-distributed data.
Challenges

At one extreme, distributed data inputs can be sent to a single central data center for processing, but this might be prohibitely slow or excessively costly in terms of bandwidth or in terms of the need to fit all data in one location.

At the other extreme, computation can be mapped onto the input data in situ, yielding intermediate results that typically require further aggregation in order to compute a final result. If this intermediate data volume is larger than the input data volume, than this aggregation might be more costly than moving the data to a central site in the first place.
Indeed neither of these extremes is the right answer in all cases. Instead, the best approach typically lies somewhere along the continuum between the two extremes.
Approach
We have used MapReduce as a vehicle to explore this spectrum and develop techniques to determine task and data placement in order to reduce MapReduce job execution time. This work falls into three categories:
- Demonstrating the weaknesses of the popular Hadoop MapReduce implementation in geo-distributed settings
- Developing a model-driven optimization approach to compute an optimal placement of tasks within a geo-distributed MapReduce application
- Implementing scheduling techniques in Hadoop in order to make data and task placement better aware of their impact on the end-to-end execution time.
Publications

Cross-Phase Optimization in MapReduce. Book chapter in Cloud Computing for Data-Intensive Applications. Springer. .

End-to-End Optimization for Geo-Distributed MapReduce. In IEEE Transactions on Cloud Computing.

Cross-phase Optimization in MapReduce. In the Proceedings of the IEEE International Conference on Cloud Engineering (IC2E). Redwood City, CA. .

Improving MapReduce Performance in Highly Distributed Environments using End-to-End Optimization. Poster at the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Hollywood, CA .

Optimizing MapReduce for Highly Distributed Environments. Technical Report 12-003. Department of Computer Science and Engineering, University of Minnesota. .

Exploring MapReduce Efficiency with Highly-Distributed Data. In the Proceedings of the Second International Workshop on MapReduce and its Applications (MAPREDUCE). San Jose, CA. .