HDFS or Hadoop distributed file system was inspired from Google file system, GFS and Hadoop MapReduce framework was inspired from Google MapReduce papers. Both MapReduce and HDFS work on a cluster of systems and both of them have the hierarchical architecture that is there is a master-slave model. So in a larger view, what happens is that a larger file is broken into smaller portions known as blocks which are then replicated and distributed over a cluster of computers. This distribution is managed by Hadoop itself and the user doesn’t have to worry about the division or the distribution of the file. Like an operating system, Hadoop manages the file system; internally there is one master node known as name node.
This name node makes sure that the data is distributed among the data nodes and keeps track of the distribution of the blocks. So, basically the name node manages the file system and the data node actually stores the data blocks. Both the name node and the data nodes are the Hadoop daemons which are Java programs that run on specific machines. Hence, these aren’t hardware components; rather the machines that run Hadoop daemons name node need to be more powerful than the machines that run the data nodes. Hence, there is a difference in the specifications and configurations of the machines and so the physical machines are referred to as name node and data node but actually they are just the Java programs.
Then in MapReduce framework what happens is that the problem is divided into two phases: the Map phase and the Reduce phase. In the Map phase, the Mapper code is distributed between the machines and they work on the data which is locally present on that machine; this concept is technically termed as data locality. The results produced from the computation of the data which has been computed locally is aggregated and transferred to the Reducer; Reduce algorithms are then applied to this global data to produce the final result. Programmers need to write only the Map logic and the Reduce logic; the correct distribution of the Map code to the correct machines is all handled by Hadoop. Again, this is a very superficial overview of how a job is done.
There are various complexities in designing the solution as now there are two phases in which solution needs to be divided. So the basic idea is that the job is broken down in between a cluster of computers and in order to process in a distributed fashion the algorithm needs to be broken into two phases known as Map phase and Reduce phase, which is basically very different from the approach that was used previously; when the computation was done on a single machine there was only a single algorithm to process the data. But with distributed frameworks like Hadoop the solution needs two components: the Map phase where the data is treated locally and the Reduce phase where the data is processed on global aggregate of the output of the Map phase.