Hadoop – Some Myths Busted (7 to 10)
Fact Number 7: Often people mistakenly describe MapReduce as an analytic engine but that’s not quite the whole truth; that’s just a part true. MapReduce will provide control for analytics, but it itself is not an analytic, per se. So, if one really has to describe MapReduce think of it as an execution engine. MapReduce can take a wide variety hand-coded logic typically in languages like Java and or C, C++.
MapReduce can take hand-coded logic and provide multi-threading capabilities for it. So, that’s what MapReduce is: it maps to different threads. It takes logic and maps to multiple threads and it is multi-threading that gets you scalability and speed. And then it takes the results that come back from each thread and reduces them to a single result. So, you get the idea of what mapping and reducing reis; ’s creation that logic cg threads and then consolidating results from them. So, MapReduce is really great at creating a type of parallelized execution for a wide variety o, by the way, by the way an be anything; the hand-coded logic doesn’t necessarily have to be the analytics; it could be ETL kind of logic or counting things for a basic report, etc.
Fact Number 8: We do hear people talking about Hadoop as a highly scalable environment for large volumes of data and that is true. That’s very true, but Hadoop is also about data diversity. Now if we take HDFS filesystem, theoretically it can manage any kind of data. But the data managed can vary a lot; it can be structured data like that which comes in tables generally used in Data Warehousing. Or it can be a mix of structured and unstructured data generally referred to as semi-structured data; like those that come in XML. XML is another file type that HDFS is very good at managing.
A lot of data can also come in unstructured format where it has a large amount of human language; like the data received from clients, their problems, feedback or data collected by insurance companies and all this comes in files. So, all that adds up to a lot of data volume but actually there is also a lot of data diversity. So, Hadoop has the capability of managing file-based data the ranges across the whole continuum from structured to semi-structured to unstructured data and all kinds of variation in between.
Fact Number 9: Let’s move on to fact number nine and that is that Hadoop does compliment a warehouse and it’s really a replacement. But there are not really many people who have actually replaced a warehouse with Hadoop. One should not expect Hadoop to replace a warehouse and it hasn’t really replaced anything. This makes perfect sense for a variety of reasons like first of all if you think about software portfolio management in any discipline within IT how often is it that you actually decommission anything? Whether it is Data Warehousing or any other platform it is rare that stuff comes out. One typically adds more stuff to have more capability and typically don’t take out the old stuff because one typically loses capabilities if they do. So we have this long tradition of “wedge systems”, which are a part of Warehouse environment, but which aren’t Warehouse proper. And that is how Hadoop is; it isn’t going to replace the already existing systems but will become a part of the wedge systems and provide additional functionality.
Fact Number 10: Hadoop enables all types of analytic and not just web analytics. If you read the IT press they’re always talking about what the large Internet firms have done with Hadoop and in other cases quite often Hadoop is always there for analytics and it almost always exclusively web analytics. So people are typically looking for hits on web pages and counting Internet Protocol location, DNS and also e-commerce companies, but the contribution of Hadoop isn’t just limited to web analytics. Normal conventional companies like a lot of insurance company members use Hadoop for things like the unstructured textual data that they collect for the claims process where there are a lot of English language descriptions about losses, etc. So, we’re beginning to see other industries using Hadoop for analyzing a lot of data like the health care sector.
Additionally a lot of companies are using Hadoop to bring new life to older applications; such applications that require data mining statistical analysis, etc benefit from using Hadoop. Hadoop is also is very good at helping us capture new data sources like machine data, social media data which can actually bring in a lot more information about customers from channels that probably haven’t been tapped for customer data before. Thus, one can expect to find Hadoop involved in a growing range of analytics in the future.
So, these are some the prevalent myths about Hadoop and the real facts!