Mechanisms used by Hadoop – MapReduce Computational Mechanism
Today we shall explore the Hadoop MapReduce computational mechanism, which again like everything else in Hadoop is rooted in a small set of very simple concepts. As the name suggests, MapReduce sees a computational task as consisting of two phases: the ‘mapper’ phase and the ‘reduce’ phase which are executed sequentially (first the mapper phase and then the reduce phase). During the mapper phase, all nodes will do the same computation but on a part of the dataset that is co-located in the node or closer to it. In other words, MapReduces uses the principle of data locality to increase performance and minimize information movement. It is important to note that because all the file blocks in the Hadoop distributed file system are of equal size, the mapper computation can be equally divided between the nodes. If the file blocks were not of equal size, the processing time would be dictated by the time it took to process the longest file block while the other nodes would be idling.
Consider the situation where you have a file that aggregates blog entries related to Big data posted in the last 24 hours. The file has been restored in a cluster using the Hadoop distributed file system under the name of big data.txt which divided the file into file blocks and stored at least three copies of each file block across the nodes in a 50 node cluster. In the example, you want to start analyzing this big data by first counting the number of times that Hadoop, Big Data, and Green Plum have been mentioned. Assume you have a mapper function that takes as input the address of a file block and counts the number of times each one of these words appears. Each node participating in the mapper phase will then receive a pointer to the mapper function and the address of a file block co-located in the node. In this example, assume that three nodes are participating, namely node 1, node 2 and node 3.
Another simple principle around MapReduce is that the output of the mapper phase and the input and output of the reduction phase consist of a list of key-value pairs. This is such an important concept in MapReduce that it is worth putting it in the context of the example we are working with. In this example, a key would be the name Hadoop, Big Data or Greenplum and the value would be the number of times the name appears in the file. After executing the mapper function each node would produce a list of key-value pairs, where they key is either the name Hadoop, Big Data or Greenplum and the value indicates the number of times that name appeared in the text. For example, the output for node1 could be big data,5 Greenplum,7 and Hadoop,4, while the output for node2 could be Big Data,9 Greenplum,8 Hadoop,6 and the output for node3 could be Big Date,3 Greenplum,4 Hadoop,9. In MapReduce, a node is selected to execute the reduce function; all other nodes need to send the key-value pairs in the list created by the map function to the designated node. Assuming that node 2 in our example would be the designated reduce node, Node1 and Node3 would send their results to Node 2.
Typically during the reduce phase the entries are sorted and shuffled first before the reduction course. In our example the input to reduce could be Big Data-7, GreenPlum-5, Hadoop-4 coming from Node1, Big Data-5, Greenplum-8, Hadoop-6 already residing in the local disk on Node-2, Big date-3, Greenplum-4, Hadoop-9, coming from Node-3 and after sorting, where the key-value pairs with the same key have been grouped together it would be Big Data-7, Big Data-9, Big Data-3, GreenPlum-5 GreenPlum-8, GreenPlum-4, Hadoop-4, Hadoop-6, Hadoop-9. And after shuffling, where the key-value pairs with the same key have been combined into one entry with a single key in a list of values, the result would be Big Data-3,7,9, GreenPlum-4,5,8, Hadoop-4,6,9. The reduce function then adds the values up for each key-value pair leading to the final result: Big Data-19, GreenPlum-17, Hadoop-19. In the MapReduce framework both the mapper and reduce operations are considered tasks and the tasks together for a job. The MapReduce framework requires that the job tracker coordinates all the day-to-day jobs ran on the system by dividing the job into tasks and scheduling them to run on nodes. It also keeps track of all the nodes participating in a computation, monitor their status, orchestrate the data flow and handle task failures. A number of task trackers run tasks and send progress reports to the job trackers. If a task fails the job tracker can reschedule it on a different task track. The task tracker keeps track of all tasks running on its nodes, be it a mapper or reduce task. Similar to Hadoop distributed file system where node assumed the role of a name node in MapReduce assumes the role of a job tracker.
Before running a job a Hadoop programmer provides Hadoop MapReduce with the following information: the location of the data to be processed, which consists of a list of file blocks to be processed, the addresses for all its copies provided by the Hadoop distributed file system, a map function to be executed during the map phase in a reduce function to be executed during the reduce phase. The programmer obtains, as a result, a list of key-value pairs. Thus, this is how the Hadoop MapReduce mechanism works.