Assignment 5

Due Wednesday, November 9, 2011 5:15pm in class
(at the start of recitation)

Introduction

Please answer the questions precisely and concisely. You may need to do a bit of web surfing for many of these questions. Every question can be answered in one or at most a few sentences. I will not have the patience to read long paragraphs or essays and you may lose credit for possibly correct answers.

Write neatly (or type). If I have to struggle to figure out what you wrote, you will lose credit. Type your answers if your penmanship is poor.

In the "I shouldn't have to tell you this" department... Should you feel the need to use multiple pages, please fasten the sheets securely. Use a stapler or other permanent fastener. Avoid paper clips since sheets can slide out.

Reading

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat. Google, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
Read the pdf as the primary reference but also take a look at the html slides.

Questions

  1. The user's reduce function is called many times. For each instance of a call, what data is it applied to?
  2. How does Google's MapReduce implementation address the problem of straggelers?
  3. (a) What is the purpose of the partitioning function?
    (b) What is the default partitioning function?
    (c) Why might you want a different one?
  4. What is the purpose of sorting the intermediate keys?