+91 70951 67689 datalabs.training@gmail.com

File storage in Hadoop file system


Here we are going to learn how data is generally stored in a Hadoop file system. If we take a single file and it is to be stored on Hadoop distributed file system firstly the file is broken up into parts such that each piece is of the same size. You should know here that the size of blocks can be defined, but it is mandatory that all the blocks are of the same size.


To take advantage of Hadoop features, you would generally implement Hadoop on a cluster of computers. When you store a file it will be broken up into blocks of fixed sizes and each of these blocks will go into a computer present in the cluster. It is not necessary that all the blocks will go onto the same computer; generally they would be spread across the cluster. In this context, you have to know that, each computer in a cluster is called a data node. The computer on which each block of data goes is the target computer and these target machines are selected randomly by the Hadoop system.


So, it has to be understood that when we are storing a file using Hadoop, even when it is a single file, multiple computers are playing a role. This is very advantageous because using this method, large-sized files can be stored on the Hadoop file system. In fact, a file larger than the capacity of each hard disk of the node can be stored because the file is broken down into blocks and the blocks are spread across the cluster.


Generally, the block size is set to 64 MB by default but it can be changed as per the requirement of the user. Comparing the block size in Hadoop distributed file system to the block size in other structured file systems it can be noticed that in Hadoop the block size is multiple times higher. This larger block size is advantageous in the sense that it results in faster streaming of data. You should also know that Hadoop makes use of something called replication that helps it combat hardware failure.


This is a simple description of how data is stored in Hadoop distributed file system and how large file sizes can be accommodated by this particular file system