Learn Hadoop Basics – How MapReduce works in a shared cluster?
In this article, we are going learn about Hadoop. Now we are going to start with something we all know, a simple web application; typically we are going to separate our presentation and our business logic layer and running them just on two nodes which talk with a database. If we want to make it scalable we can add as many copies as we want since everything is stateless. This is all running on a cluster which we are all familiar with. Additionally we can make our web app log and save the log elsewhere and do something with it.
So, now if we have many app and hence many little log files, what are we going to do with all of them? the log files will have a number of keys to users’ behavior; they could tell us what the users have viewed, what they clicked, how long they have stayed on the page, what they clicked on most often and much more. Thus by analyzing the log files we could make our web app better by optimizing the web app according to the user behavior or perhaps personalize the experience for each user.
Well, if one could get so much important information from these logs, why don’t people do it more often? Well, these weblogs are huge; they are gigabytes or even terabytes of data. So, this huge data is too huge to even store, so how can one analyze it? Well, the best solution for this type of “Big Data” is Hadoop; Hadoop, its symbol is the elephant, can go the heavy lifting when it comes to Big Data. So, Hadoop is open-source and it is free and by the way it’s written almost entirely in Java. It runs on commodity hardware which is cheap hardware; hence it eliminates the necessity of purchasing very costly Oracle rack or other expensive storage that is super reliable. Hence, we can store months, if not years, of weblogs on it quite cheaply.
Another advantage of Hadoop apart from cheap storage option is one can easily analyze the data using MapReduce. This paradigm was introduced by Google to its papers something like 10 years ago and a few years later an open source project started to implement it in Java, which is Hadoop. What it does is batch processing of very large files in the distributed file system. The paradigm is very simple and yet very powerful. When Google introduced this they said that they were running their anchor text, their algorithm, and their PageRank algorithm, for instance, in MapReduce. And hence, distributed file system and MapReduce are the two keys to their initial success of how they make web search better. Additionally, many other people found that they could do a lot with MapReduce.
So, what we could do with that is we can turn our web app into a data-driven app. So what does a MapReduce app look like? It runs in a cluster and it runs tasks there; there are two types of tasks – Mappers and Reducers. The Mappers read a part of the input data and each of the parts is called a split. Now each Mapper does some processing on the data and spits out key-value pairs. Now these key-value pairs are sorted, and this is called the shuffle phase, and then sent to a Reducer in such a way that all the processes for a single key are sent to the same reducer. In that way, the Reducer receives a bag of all the values for a single key and now it can aggregate, analyze or do whatever it needs to get the final output. This way one can build complex workflows of MapReduce jobs by taking the output of one Reducer and giving it as an input to another and most importantly we can run many of them at the same time in a shared cluster.