Some common Hadoop Interview questions – Part 2
What is speculative execution? In HDFS, there is job tracker and task tracker, where job tracker is the master service and task tracker are slave service. So, just as a data node passes periodic information or heartbeat to the name node, the task tracker also passes periodic information to job tracker every three seconds. In case the task tracker doesn’t contact the job tracker for more than 3 seconds, the job tracker waits for another 10 heartbeats time. And if the task tracker doesn’t contact the job tracker even after the 10 time, it assumes that either the task tracker is working very slowly or it is dead. And once the job tracker reaches this decision, it passes on the job to the other two nodes that have the same data block and that node that works most quickly is given the job. And this is called speculative execution. In other words, if a node appears to be running very slowly of if it’s dead the master node can redundantly execute another instance of the same task and the first output will be taken. And this process is called speculative execution.
What is the default replication factor in HDFS? The default replication factor in HDFS is three; hence for every data block there will be three replications. Generally, these replications are set up in hdfs-site.html. In case users so require the default replication factor can be set to their preference; that is, it can be increased or decreased according to users’ preference. It is possible to set the replication factor as below in hdfs-site.html file:
If the input file’s size is 200 MB how many input splits will be given by HDFS and what is the size of each input split? By default the size of any block in HDFS is 64 MB; hence, 200 MB data will be split into 4 blocks. The first three will have a size 64 MB each and the last one will be of 8 MB. With the last block where only 8 MB of space is used, the rest will not be wasted but will be utilized by some other file.
Where do we have data locality in MapReduce? This is a very important aspect! First let’s understand the term data locality: if the data is available on the local machine that is going to process the data then we can say that a program is having data locality. To understand this better let’s go deeply into the concept of MapReduce. MapReduce is actually a framework; it is a technique to process data stored in HDFS and hence it is a distributed processing technique. Map is a process that is working on data local to that data node. Here the job tracker assigns a job to some task tracker which in turn applies the job to the data local to a particular data node, and this process is called Map. Hence, we can say that Map is having data locality. But Reducer will not have any data locality; the reason for this is as follows: once the Mappers in the various data nodes have done their job the Reducer comes into a picture. The job of the Reducer is to pull data from different Mapper and combining the outputs and it stores them in the local file system. So, since Reducer collects the various outputs from different Mappers located on different data nodes and doesn’t work on the data local to the data node, we say it doesn’t have data locality. In other words, in MapReduce, Map function has data locality but Reducer function doesn’t.