First let us understand the concept of Big Data: firstly we know that governments and businesses are all gathering data these days; the data is for a lot of different types including movies, images, transactions and much more. But why gather all this data and how is it useful? Well, the answer is: all this data is very valuable; analyzing all this data lets us do things like fraud going many years back. Additionally today storing data can be done quite cheaply and so we can afford to store all this data. But there is a catch; all this data won’t fit in a single disk or processor and so we have to distribute it across thousands of nodes.
But there is advantage in distributing data; if we can distribute, we can run in parallel and we can compute thousands of times faster and do things we couldn’t possibly do before and that is the concept behind Hadoop. So, this is how Hadoop works: suppose what I wanted to do was look for an image spread across hundreds of files. So first off Hadoop has to know where that data is. It goes and queries something called the name node to know about all the places where the data file is located. Once it has that figured out it sends your job to each one of those nodes. Each one of those processors independently reads the input file, looks for the image and writes out the result to a local output file. all this is done in parallel and when they all report finished your job is done.
Here we have seen one simple example of what we might do with Hadoop – image recognition. But there is a lot more to it than that. For example I could do the statistical analysis: I might want to calculate means, averages and correlations on all sorts of data. For example, I might want to look at unemployment vs. population vs. income vs. states. If I have all that data in Hadoop I could do that. I can also do machine learning and all sorts of other analysis. Once you got the data in Hadoop there is almost no limit to what you can do.
So, we’ve seen that in Hadoop data is always distributed, both the input and the output. There is more to it than that; the data is also replicated. Copies are kept in all the data blocks. So if one node fails it doesn’t affect the result. That’s how we get reliability. But sometimes we need to communicate between nodes; it is not enough that everybody processes their local data alone. An example is counting or sorting; in that case sorting is required and the trick for that in Hadoop is called MapReduce.
Let us look at an example of how MapReduce works. What we are going to do is take a little application called count dates, that counts the number of times a data occurred counted across many files. The first phase is called the Map Phase; each processor that has the input file reads the input file, counts the number of times the date has occurred in that file and writes it as a set of key-value pairs. After that is done we have what is called the Shuffle Phase. Hadoop automatically sends all of the 2000 data to one processor, 2001 data to another, 2002 data to another processor, etc. After that shuffle phase is complete we can do what is called the reduce phase. In the reduce phase all of the 2000 data is summed up and written to the output file. When everybody is done with their summations and the report is done then the job is done.