A good working definition of Hadoop is it is a framework of tools, libraries, and methodologies. Another thing that we need to understand about Hadoop is it is an open source package which means that no one can control the source code and it is available to everybody.
To understand what the motivation was behind designing this framework of tools we have to understand the concept of Big Data. These are some of the general characteristics of Big Data; there may be other things these are generally the most important criteria:
- Big data is usually large and growing data files.
- They are commonly measured in Terabytes (which is 10^12 bytes) or Petabytes (10^15); so these are extremely large data sizes.
- It is usually is unstructured data which means that it doesn’t fit into any relational database model as easily as one would hope.
- This data is usually derived from users, applications, systems, and sensors.
In our ever more connected world, the amount of data that we are collecting and that is made available has been growing tremendously. One of the issues that are being faced by modern computing is the data throughput mismatch. A generously done simple-minded analysis can bring out what that actually is. If we consider a standard spinning hard drive, it can deliver 60-100 MB/second – streaming, not random. And the newer solid state hard drive can deliver 250-500MB/second and that can be streaming and random because of the solid state technology.
Essentially though if we look at all the technology curves hard drive speeds in terms of data delivery, reading and writing has been relatively flat. Hard drive capacity, on the other hand, had continued to grow and right now 4 TB drives are available at the consumer level. Online data growth continues to double every 18 months and as we all know processor speeds, performance, etc has been making similar gains with the addition of new processors and multi-core packages.
So, what do we find looking at this data? The real difficulty is moving data on and off the disk; everything else seems to be growing nicely but when it comes to moving data to and from disk we have a little bit of bottleneck. For example, a 1 TB data file will take 10,000 seconds, which is 167 minutes to read at 100 MB/second on a spinning disk and that is assuming everything else is perfect. Similarly, this can be done with a solid state disk 2000 seconds, which is about 33 minutes and that is if we could achieve 500 MB/second.
But in the Big Data world a terabyte size file is what is considered a “small” file. So, if a small file takes 33 minutes just to read through what if you need to do some analysis on several of this size files? It is obvious how this “mismatch” is really going to slow the things down. Of course, we can do things with RAID and adding more IO controllers and things like that, but essential we are going to hit some sort of bottleneck on a single storage device.
So, the need for parallel data access is essential for Big Data and that means things are to be done the way Hadoop does it and the way most of the other MapReduce operations are done. What happens here is they break the data up into various pieces and do individual searches on the pieces of data and then combine the results. Thus, to overcome this problem in the Big Data world designers have come up with Hadoop and the related ecosystem.