Map Reduce

MapReduce is a programming model used for processing large datasets in parallel. It involves two main steps:

  1. Map - The master node takes the input data and divides it into smaller chunks, distributing them to worker nodes. Each worker node processes its chunk of data and generates key-value pairs as output.

  2. Reduce - The master node aggregates the key-value pairs from all worker nodes with the same keys. It then processes the values together to generate the final output.

MapReduce allows scalable processing of huge datasets by distributing work across clusters in parallel. Popular implementations include Hadoop and Spark.

Solution

Here is a simple example to find maximum temperature by station using MapReduce:

Map Step

station_id, temp
01, 45
01, 20
02, 30 
02, 25

Map into (station_id, temp) key-value pairs:

01, 45
01, 20 
02, 30
02, 25

Reduce Step

Aggregate temperatures by station:

01, [45, 20] 
02, [30, 25]

Find max temp for each station:

01, 45
02, 30

MapReduce enables scalable parallel processing on huge datasets.

Description: Map Reduce

Map Reduce is a programming model for processing and generating large datasets that can be distributed and parallelized across a cluster of computers. The framework consists of a “Map” step, where a function is applied to each element in the input dataset to produce key-value pairs. These are then shuffled and sorted by key. Finally, the “Reduce” step processes each set of values that share the same key and combines them into a single output value.

Solution:

Below are implementations of a simple word count problem using Map Reduce in Java, C++, and Python.

Java

In Java, you can use the Hadoop framework to implement Map Reduce. Here is a brief example of the word count problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

public class WordCount {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
}

C++

C++ doesn’t have a standard Map Reduce library like Hadoop in Java. However, you can simulate the functionality manually.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <unordered_map>
#include <vector>

void map(std::string text, std::unordered_map<std::string, int>& counts) {
    // Tokenize and count words
}

void reduce(std::unordered_map<std::string, int>& counts) {
    // Combine counts
}

int main() {
    std::unordered_map<std::string, int> counts;
    std::string text = "Hello world hello";
    map(text, counts);
    reduce(counts);
}

Python

Python’s Hadoop Streaming API can be used for Map Reduce. You can also easily simulate it using standard Python libraries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from collections import Counter

def mapper(text):
    words = text.split()
    return Counter(words)

def reducer(count1, count2):
    return count1 + count2

if __name__ == '__main__':
    text1 = "hello world"
    text2 = "hello"

    map1 = mapper(text1)
    map2 = mapper(text2)

    result = reducer(map1, map2)
    print(result)

Key Takeaways:

  • Map Reduce is a two-step process: Map and Reduce.
  • It is commonly used for data processing in distributed systems.
  • Libraries like Hadoop in Java and Hadoop Streaming in Python facilitate implementing Map Reduce.
  • In C++, you would typically implement the Map and Reduce functionalities manually.