+91 70951 67689 datalabs.training@gmail.com

 HDFS is…


To understand how it is possible to scale a Hadoop cluster to hundreds or even thousands of nodes, we have to start by understanding HDFS or the Hadoop Distributed File System. Data in Hadoop cluster is broken down into smaller pieces called blocks that are distributed throughout the cluster. In this way, the Map and Reduce functions can be executed in parallel on smaller subsets of our data. This provides the scalability that is needed for Big Data processing.


The goal of Hadoop is to use commonly available servers in a very large cluster where each server has its own set of internal disk drives. For higher performance, MapReduce tries to assign workloads to that server where the data to be processed is stored. This is known as data locality. Think about a 1000 machine cluster where each machine has 3 disk drives; then consider a failure rate of a cluster composed of 3000 drives and a thousand servers. The component meantime to failure we are going to experience in Hadoop is very high.


The cool thing about Hadoop is the reality of the failure rates is well understood. It is actually a design point; and part of the strength of Hadoop it has built-in fault tolerance and fault compensation capabilities. HDFS tolerates disk failures by storing multiple copies of each data block on different servers in a Hadoop cluster. An individual file is actually stored as smaller blocks that are replicated across multiple servers across the cluster.


To maintain availability when disks fail HDFS replicates these smaller pieces onto additional servers. The default is to store two additional copies, but that can be increased or decreased on a per file basis or for the whole environment. This design offers multiple benefits, the most obvious being higher availability. It also allows Hadoop to break a work into multiple jobs; these jobs run in parallel on different servers, each processing a block of data. This gives excellent scalability. Finally, you get the benefit of data locality which is critical in working with large datasets.