Some common Hadoop Interview questions – Part 3
What is the difference between CDH3 and CDH4 Hadoop with regard to NameNode? This is also an important question; in CDH3 there will be only one name node and in CDH4 there will be two name nodes. Hence, in CDH3 in case the only name node goes down the entire HDFS becomes inaccessible but with CDH4 even though the first name node does down the problem of inaccessibility is eliminated since the 2nd one will come into picture. Hence, simply we can say in CDH4 we are having high availability of name nodes. In CDH4 where there are two name nodes, one of them will be in active mode and one will be in passive mode. And only when the active one goes down the passive node will be activated and will take the failed name nodes responsibility. In short, in CDH3 the name node is the single point of failure. But with CDH4 there will be an active name node and a passive name node; when the active one fails the passive node is activated and takes the first failed name node’s responsibility.
What is a heartbeat in HDFS? In HDFS, there are four functions: the NameNode, the DataNode, the JobTracker and the TaskTracker. Now HDFS runs on a Master-Slave function where the data node is the slave and name node is the master. Similarly, the task tracker is the slave and the job tracker is the master. Hence, the slave keeps contacting the master regularly to inform it that it is alive. Hence the data node and the task tracker keep sending the name node and the job tracker respectively a heartbeat to inform them that they are alive and functioning. Name node receives a heartbeat from the data node every 10 seconds and the job tracker receives a heartbeat from the task tracker every 3 seconds. In case there is no signal sent within the stipulated time the master daemon will consider the slave to be dead and assign the task to another data node or task tracker. Hence a “heartbeat” is a signal that is being sent to the master daemon by the slave to say that it is alive and working.
Do we need 2 replications in one rack and 1 replication in another rack? This is a typical question and hence very important. So it is mandatory to place 2 replications in one rack and 1 replication in another rack? Yes, it is absolutely mandatory; this is because: when some data is to be stored in the HDFS and 3 replications are set as default, the data is stored in two devices of one rack and another device of a different rack. And this is not only mandatory but also advantageous since even if the entire rack fails the data isn’t lost since it is stored in another file. So, having data stored on a different rack is advantageous. Now the second point is why not store the data in three different racks; why two in one rack and one in another? This is because when one data node fails if the same data is present on another data node of the same rack then much time isn’t lost when shifting the task to another node of the same rack. Hence, the concept of 2 replications in one rack and 1 replication in another rack is not only mandatory it is also advantageous as it helps with data node failure.