The Physical Storage of HDFS – Hadoop Cluster
Here we are going to look at the physical storage of HDFS because it’s important for us to have some understanding of how MapReduce jobs will run when we deploy them to an actual cluster.
Each of these squares represents a physical machine and on each machine there would be a Java Virtual Machine so in this case we would have five Java Virtual Machines and they would run daemons are services to perform management. Now we have the HDFS management, the storage management and also coming up we shall talk about the MapReduce which is the job management. There will be one machine that runs the Name Node service which maps the information out to where the various data nodes are and by default here we have shown three data nodes which are alike the simplest possible sort of a staging of cluster environment because you normally, as we mentioned, have three copies of the data available.
Now when you are developing, it’s common that you will have the name node and the data node on the same machine and you’re working with what is called pseudo-distributed, where you have just one date node. In fact, you do not actually have to use HDFS to test out your MapReduce; you can run then again the local filesystem, and some developers do that as well. Now, the secondary name node, it depends on which version of Hadoop you’re working with, a relatively new improvement to Hadoop is this ability to have high-availability against the name node. And this was released late last year into most the commercial distributions. Up until this point the name node, if it were to fail would cause the Hadoop cluster to fail; so there wasn’t high availability built-in and some of the commercial distributions are still based on the older binaries of Hadoop. That is an important thing to check to see if the HA is available; high availability is something that you should discuss with your Hadoop administrator but for development environment normally you’re just going to have your daemons running on the same machine.
Speaking of developers, this is an image from Amazon representing their implementation MapReduce which is pretty standard at this level and they’ve used colours here to indicate to us developers so that we can have an understanding of how MapReduce actually will be executed through our code on a Hadoop cluster. So, we can see we’ve got three major parts and, to introduce the topic, as a developer we will write the code that will implement the map and reduce by default; and this is required as there’s no sample or stuff or anything and we have to write this out.
The map code takes on input values on each of our notes and each of the squares here in the diagram are representing a physical data node and performs some type of process on the data on each node and then comes with some output for each node. Once the Mappers are complete then the shuffle and sort, which they call the secret sauce of MapReduce, takes over and then does a movement of the data to the appropriate reducer notes. This happens to show three Mappers and three Reducers, which is not representative to actual MapReduce; generally it’s exponentially more map jobs and quite substantially fewer reduce job. So it could be 1000 Map jobs and you know 200 Reducer jobs and so on and so forth.
Also, it’s important to understand that a MapReduce job process flow is designed to do one type of process against the large set of data. It’s quite common in the real world that multiple MapReduce jobs are chained together one after another kind of like, for us coming from the relational world, the multiple tasks in a SSIS process.