Some common Hadoop Interview questions – Part 1
Here we are going to see some of the most common Hadoop interview questions and their answers. The first is what is the responsibility of the name node in HDFS? Basically, the responsibility of the name node is creating metadata for the blocks that we are storing in different data nodes. So, when a client sends some data, the data node is the one that know which nodes have free space to store that data. So, the main intention of the name node is to create metadata for blocks in different nodes. For every short period of time, every node is sending some block report to this name node, from which the name node is creating relevant metadata for each block. And in case the name node doesn’t get these regular updates, it understands that that particular node which isn’t giving the updates is dead. Thus, we can say that the name node is a single point of failover because if name node is down the entire cluster is inaccessible. In short, NameNode is a master daemon for creating metadata for blocks stored on data nodes. Every data node regularly sends block reports to name a node. If this doesn’t happen, the data node will be considered dead. And finally, if name node goes down HDFS becomes inaccessible.
How does the name node handle data node failures? Data node failure can be quite common since we generally use commodity hardware; so, even if a data node is down cared should be taken that there is no loss of data. And to make it happen HDFS has a good amount of fault tolerance. This is taken care of by the name node; once name node recognizes failure of a data node it removes the failed node and assigns its responsibility to another node; in other words, the failed data node is replicated by a new data node in its metadata. Hence, the name node is just copying the data from the failed node to the new node and this is done through some pipelining and not through the name node. Hence, the problem of failure of data nodes is taken care of by replacement of the failed name nodes and replication of data directly from the failed node to the new node.
What is fault tolerance in HDFS? Since HDFS uses commodity hardware the chance of failure of data nodes and task trackers is quite high. And in case of failure of data nodes it can lead to loss of data; to avoid this HDFS comes with fault tolerance. By fault tolerance we mean they will not be any losses even on failure of some of the devices that make up the HDFS system. So how is this possible? This is possible because HDFS has been designed such that three replications are done by default; hence the same information is stored on three nodes so that even if one fails the information on the node isn’t lost and the job can be run on either of the other nodes that have the data. Thus, fault tolerance is a feature in HDFS where data is stored on three nodes by default, so that even if one of the nodes fails the data isn’t lost and the required job can be run on either of the other two nodes.