Step-by-step process of getting started with using Hadoop
With Hadoop, the question is how can we use Hadoop? With Hadoop, there is this tool that can call through data very fast and make quick lookups in a structured table and those are the two tools that we can play off each other. With NoSQL, we can’t look through all the data to get the big picture, but if we know where to look, it allows you to do so. NoSQL also supports many more users that what MapReduce does; MapReduce is much slower, but it is fundamentally much more powerful analytically in that it can look for much more data.
To compare, databases, in general, are comparable to speedboats; that is if you want to get from point A to point B very fast and if we need to carry only a few people speedboats are great. But when we want to haul 1000 tons of something across the ocean you want to use a tugboat and this is comparable to Hadoop. And so in terms of your business, to get a broader picture, we can play the two against each other. So the most powerful applications and a lot of mash-ups that are being used now basically leverage both of these MapReduce and NoSQL databases.
How do we get started with Hadoop; first we buy a Hadoop cluster and jump in with both feet. So here is a step by step process:
- As a first step, we have to put in all the data we have in Hadoop just like any database.
- Next step is you might do some Big Data analytics. We might do something like MapReduce jobs, or a Pig thing, etc. The basic idea is to do something that is algorithmically simple, relatively straightforward and quite powerful. It should be something you couldn’t do unless you look at the data in the aggregate. From the business perspective, the second step kind of pays the bills; you invested something in Hadoop thinking what I might get out of it. So, we are trying to get something out of it relatively quickly doing Big Data analytics, creating some sort of fancy picture or something to provide some sort of high-level insight.
- The next step is to put all your structured and maybe even unstructured data in some sort of NoSQL database. Since you already have all the raw data on Hadoop no we have to make all that data live; serve that up through some sort of API or a basic search using some sort of No SQL database. So, it is important to understand that we aren’t suggesting an exhaustive, analytics engine; it should be just something simple using which you can ask simple questions.
- This is the step where you will be having the really large impact and that is the pre-computation step. The idea is to take the output of the Big Data analytics, put it in a NoSQL database and serve it for folks.