Comparing Hadoop with conventional databases
we have seen couple of great examples of how Hadoop works. The next question is how Hadoop compares to conventional relational databases because they have dominated the market for years. We have seen one big difference that in Hadoop data is distributed across many nodes and the processing of the data is distributed too. By contrast in a conventional database conceptually all the data sits on one server and one database. But there are more differences than that.
The biggest is that in Hadoop the data is written once – read many. That means, once you have written data you are not allowed to modify it; you can delete it, but you can’t modify it. By contrast in relational databases data can be written many times; an example for this is the balance of your account. But in archival data, like the one Hadoop is optimized for, once you have written the data you wouldn’t want to modify it. For example, archival data about telephone calls or transactions, one you have written it you wouldn’t want to modify it.
There is another difference too: in relational databases we always use SQL. By contrast, Hadoop doesn’t support SQL at all. It supports lightweight versions of SQL called NoSQL but not conventional SQL. Also, Hadoop is not a single product or platform; it is a very rich ecosystem of tools, techniques and platforms, all of which are open source and work together.
So, what is in the Hadoop eco-system? At the lowest level, Hadoop runs on commodity clustered hardware. You don’t need to buy any special hardware and it runs on many operating systems. On top of it is the Hadoop layer which is the MapReduce and the Hadoop distributed file system. And on top of that are the Hadoop tools and utilities such as RHadoop, which is the statistical data processing using the R programming language. There is a machine learning tool and also doing for doing NoSQL like Hive and Pug. And the neat thing about those tools is they support semi-structured of unstructured data. You don’t have to have your data stored in a conventional schema; instead you can read the data and figure out the schema as you go along.
Finally, we have the tools for getting data into and out of the Hadoop file system like Scoop. But what you have to understand is that the ecosystem is constantly evolving. For example now we have a new tool for managing the Pig tool called “Lipstick on a Pig”. And there are many more similar tools and that environment keeps being added to all the time.