+91 70951 67689 datalabs.training@gmail.com

Big Data and Hadoop’s role in analyzing Big Data – Part 2

We have discussed Big Data and okay we understand that big data is creating large files. But now what is the problem and where does Hadoop fit in? Here is the problem: The disc speed of a traditional hard disk could read 60-100 megabytes per second. Now we have come up with a solid state disk; it’s not a big breakthrough in terms of speed, as the speed now goes from 60-100 up to 250-500 megabytes per second. It is a breakthrough but it’s not as bigger a breakthrough that we have done in other areas. Compared to the progress we have made in other areas the speed of the disks remained relatively flat but the size of data has been growing exponentially – every fifteen months the data that we have collected doubles in size. And that is because so much data is being produced by systems, users and application and so on, as we have talked about.

Here is a little perspective for you: in 1990s the capacity of the hard disk used to be about 2.1GB, then in 2000s it was 200GB and recently it is close to 300GB; you can easily go and buy a 4 TB hard disk from the market. And the price has been reducing as well; now 1GB costs only 5cents and it used to be about $160 per gigabyte. And the speed has been growing no doubt; from 16 MB per second to about 210 MB per second. And if you look at time it’s going to take to read the whole hard disk that will be quite interesting; hard disks used to be 2.1 GB and took about 126 seconds to read the whole hard disk. So, because the disk was small and so the speed was slow. And now the disk size has increased, the speed has increased as well but not to that extent because now it takes about four hours to read the whole hard disk.

In the Big Data world 1TB file is a very small file; it could be many 1 TB files. A traditional hard disk will take about 3 hours to read this file, which is about 10,000 seconds. And a solid-state disk, which is very expensive, will take about 2000 seconds or 33 minutes to read the file. So, if it’s taking you half an hour to read one file and you have tons of these files then obviously it’s not going to cut out. In a traditional approach an enterprise will get a very powerful computer and it will feed in whatever the data is available to this computer to crunch the numbers. This computer will do a good job but only until a certain point; a point will come when this computer will not be able to due to processing anymore because it is not scalable and Big Data is growing; so traditional enterprise approach does have its limitations when it comes to Big Data.

Hadoop takes a very different approach than the enterprise approach; it breaks the data into smaller pieces and that’s why it’s able to deal with the Big Data. Breaking the data into smaller pieces is a good idea but then how are you going to perform the computation. It breaks the competition as well down into smaller pieces and it sends each piece of competition to each piece of data. The data is broken down into equal pieces so that these child complications could be finished in equal amount of time. Once all these competitions are finished then all the results are combined together and it is what it sent back to the application; as a combined overall result. Thus, these are the challenges that it produced by the Big Data and how Hadoop addressing those challenges.