Hadoop and how does it work?
Hadoop is fundamentally infrastructure software for storing and processing large datasets. It is an open-source project under Apache and it is enormously popular. To understand Hadoop you have to understand two fundamental things about it – one, how it stores files or data and two, how it processes data.
Let us first start with understanding how Hadoop stores data: Hadoop is a cluster system. You may have heard of HDFS as part of Hadoop; that stands for Hadoop Distributed File System. Image a file that is larger than your PCs storage capacity, which means the file cannot be stored. Hadoop lets you store files that are bigger than what can be stored on one particular server. It also lets you store a large number of files; imagine your same PC that could only store 50,000 files but what do you do if you have a million files?
So, Hadoop is a distributed file system and it is distributed because it has multiple nodes or servers you can have out there; you can have two nodes or ten nodes or you can have thousands of nodes. And that is exactly the type of configuration that an internet giant like Yahoo has. So, that is one of the unique characteristics of Hadoop in that it can store many files and it can store many large files.
The second character of Hadoop is its ability to process that data or at least it has a framework to process that data, which is called MapReduce. You would probably have heard of MapReduce associated with Hadoop and just for the fundamental understanding of Hadoop it is important that you understand HDFS and MapReduce. What is MapReduce? It is a tool that helps in processing all that data that is being stored on all the nodes but it does it in a unique way!
Think of the old architecture of processing data; the data is stored in one place and the code for processing data resides somewhere else. And so to make the processing happen one has to move data from the location where it is stored to the location where the code for processing resides. The moving of data generally happens over a network which can be very slow. And this is for ordinary datasets. So, imagine the scenario with a very large dataset; it takes an enormous amount of time just moving the data from a storage location to another location where it is to be processed.
So, what Hadoop does is instead of moving the data to the software it takes the processing software and sends it to where the data is, so that the processing occurs on all the nodes where the data is residing. Hence, it distributes the processing, which is called Mapping it to the data and it takes the answer and just brings the answerback, which is called Reducing. So, that lets us process very large datasets very quickly because instead of serially processing it through one pipeline the processing is being distributed. So Hadoop handles very large datasets and it can process those large datasets because of the HDFS and MapReduce.