Storing and computation of data before creation of Hadoop
Today we are living in a world of data; where ever we see there is data. But the important thing is how to store the data and how to process it. So, first we have to understand the concept of Big Data; it is data that is beyond storage capacity and beyond processing power. How are we getting that much data? There are different data generator factors like sensors, CC Cameras, social networks like FaceBook online shopping websites, airlines, NCDC, hospitality data, etc. In this way, we are getting huge data.
If you consider all the data we have in the world today, in this 90% of the data has been generated in the last two years. So, why has this much data been generated in the last two years itself? The reason is because we have been using a number of resources. People are using these resources and creating data and there is a need to store all of it as nothing can be discarded. Talking about storage, in the 1990s the hard disk capacity was 1 GB to 20 GB and RAM capacity was 64 to 128 MB and reading capacity was 10kbps approximately. As we have progressed currently, in 2014, the hard disk capacity has grown to 1 TB with a RAM capacity of 4 to 16 GB and reading capacity of 100mbps.
So in a span of 100 years, the hard disk increases 1000 times and similarly there is a huge increase in RAM and reading capacity. So, why is there is a need for such huge magnitude of an increase in the respective capacities? The reason is the huge amount of data being generated. So now that the data generated is increasing progressively there should be proper resources to store it. But there might come a time when the data storage cannot be handled by the company generating it. For this, the company has to approach a datacenter to store data. These datacenters will maintain servers like IBM servers or EMC servers, etc. and these servers are called sandboxes.
So, when we have to process the data, it should be fetched to the local server and then the data can be processed. Say right now we have 100 TB of data and we want to process some data out of it which is say 2 TB. To process data, we need to write some code which might be in Java or Python or SQL or anything else. Say we have written the relevant code and it is totally 100 KB. Hence, to process 2 Tb of data we are writing 100 KB of code. So, to compute the data, is sending the 100 KB of code to the data sitting in the data center better or getting the 2 TB of data to your local server better?
Definitely sending the 100 KB of code to the data in the data center is better because it is very less in size. Hence, the code can be transferred in lesser time as compared to transferring the 2 Tb of data to your local machine. But even if sending code to the data is very easy, we cannot do it. Why? Before Hadoop, computation is processor bound; this means where ever you have written the code you have to fetch the data to that processor and compute it there. That is the only technique that was available before the creation of Hadoop. Hence, we couldn’t send the program to the data; rather we had to fetch the data to the machine where the program is running.