Hadoop file system – Limitations
Here we are going to learn about the limitations of the Hadoop file system. Network File system is the oldest and the most commonly used distributed file system and was designed for the general class of applications. But in case of Hadoop only specific kind of applications can make use of it. It is known that Hadoop has been created to address the limitations of the distributed file system, where it can store the large amount of data, offers failure protection and provides fast access. But it should be known that the benefits that come with Hadoop come at some cost.
So, Hadoop does have some limitations, which is why it is suitable only for specific kind of applications. So here we are going to discuss these limitations of Hadoop including topics concerning random reads, updates and lost of performance. Now Hadoop is designed for applications that require random reads; so if a file has four parts the file would like to read all the parts one-by-one going from 1 to 4 till the end. Random seek is where you want to go to a specific location in the file; this is something that isn’t possible with Hadoop. Hence, Hadoop is designed for non-real-time batch processing of data.
Since Hadoop is designed for streaming reads caching of data isn’t provided. For your information, caching of data is provided which means that when you want to read data another time, it can be read very fast from the cache. This caching isn’t possible because you get faster access to the data directly by doing the sequential read; hence caching isn’t available through Hadoop.
Another characteristic that should be there in the application is: it will write the data and then it will read the data several times. It will not be updating the data that it has written; hence updating data written to closed files is not available. However, you have to know that in update 0.19 appending will be supported for those files that aren’t closed. But for those files that have been closed, updating isn’t possible.
In case of Hadoop we aren’t talking about one computer; in this scenario we usually have a large number of computers and hardware failures are unavoidable; sometime one computer will fail and sometimes the entire rack can fail too. Hadoop gives excellent protection against hardware failure; however the performance will go down proportionate to the number of computers that are down. In the big picture, it doesn’t really matter and it is not generally noticeable since if you have 100 computers and in them if 3 fail then 97 are still working. So the proportionate loss of performance isn’t that noticeable. However, the way Hadoop works there is the loss in performance. Now this loss of performance through hardware failures is something that is managed through replication strategy.
Hence, these are some of the most commonly encountered limitations in Hadoop. In this article it generally refers to the Hadoop file system.