Defining Big Data and the role of Hadoop in managing Big Data
So we have seen that when there are huge amounts of data, the data is stored to a datacenter which used servers to store large quantities of data. But people encountered problems when trying to process such massive amounts of data since before Hadoop computation is processor bound. Hence, that program that has to be run on the data cannot be sent to the data; instead, the data itself has to be transported to the processor on which the program is running.
But this process of sending data to the processor isn’t easy; there are three problems with this. The first there is the issue of volume: since the data being stored is rapidly increasing it is huge in size and will be in gigabytes, terabytes or petabytes or more. Second there is the problem of velocity in fetching data from the data center and sending it. Now to the next problem; right now we have RDBMS for storing data. This RDBMS only knows how to store structured data and process it through querying. But in the data that is stored globally today, 70-80% of it is unstructured and semi-structured data. For example, the data generated by FaceBook is in the form of videos, images, text messages, audio, etc. all of which is unstructured. A simple example of semi-structure data is log files. Hence, there is a problem of an enormous variety of data since whatever the data structure might be, whether it is structured, unstructured or semi-structured, we have to store and process it.
Hence, to define Big Data only, it is all such data that has Volume, Variety and Velocity; it can be all the three together, just one of them or a combination of two of them. And this definition of Big Data has been given by IBM. So, now we shall understand Big Data in depth. Let us say we have a system with 500 GB of hard disk capacity. So, for this system, the maximum data I can store cannot exceed 500 GB. And since we cannot store data exceeding 500 GB we cannot process it too. Hence, for a system having a storage capacity of 500 GB anything exceeding that limit shall be Big Data.
Now let us consider another concept; say we have a system with 500 GB of hard disk capacity on which we want to store 300 GB of data. Since this data is not exceeding the maximum storage capacity of the hard drive, it may be retained and can be processed too. For processing this data, some amount of time will be taken. So, what if we want to reduce the time taken for processing that data? And for this, we have what is called Hadoop. What Hadoop does to reduce the time required to process a vast amount of data is that it splits up the job into some parts and all the various parts are done in parallel. Hence, the time taken for processing this huge amount of data is reduced considerably.