Hadoop – Some Myths Busted (1 to 3)
Today many data professionals and BI people don’t know enough about Hadoop. And, though many people know a lot about Hadoop they have difficulties describing it to their peers and management. Even people who ought to know about Hadoop are saying things that aren’t really true. There really a lot of myths and misconceptions abounding around Hadoop and the various products that are part of the Hadoop family. So today we are going to know about the ten most common myths about Hadoop and bust them!
Fact Number 1: We all the time hear people talking about Hadoop as if it is a monolithic thing. Well, that isn’t quite true; Hadoop actually consists of multiple products. It is actually a brand name of a family of open source products that are overseen and administered by the Apache Software Foundation. Another problem is a lot of times when people say Hadoop they’re really thinking about one product within the Hadoop family, which is the Hadoop distributed file system, HDFS. So one has to understand that there are plenty of other products in Hadoop family, like the MapReduce, Hive, HBase, Pig, Flume, Scoop, etc and not just HDFS. Hadoop was actually created by Doug Cutting and if you are wondering why he named it so, it was named after a plush toy that was his son’s favorite. In fact, the plush toy was a yellow elephant, which is why the Hadoop logo is also a yellow elephant.
Fact Number 2: Most people think Hadoop is purely open source. But well it is more complicated than that. Hadoop is open source and it certainly originated as open source but it’s available from vendors as well. One can go to apache.org and download all the open source versions of the Hadoop family, but you can also get some these from vendors. They are what we call “Distribution” of HDFS and a lot of them are really redistributing open source HDFS to offer convenience to their customers. But there are also some vendors that add extra value and some extra functionality and quite often add extra tools on the side, like some good tools for administering HDFS. Another advantage of going to vendors is that they provide support and maintenance, which isn’t available with open source products. So these are a couple of reasons why one can prefer vendors rather than downloading directly from Apache website.
Fact Number 3: Now let us move on to the third fact; Hadoop is actually not a single product. It is an ecosystem of products with a wide variety of products. Against some are open source from Apache and some are the distributions from vendors. There are a lot of vendor products out there ranging from Reporting Tools, BI platforms, through Data Integration tools of various types. Even a lot of database management systems now support the Hadoop. So think of it as an Ecosystem; it’s not just open source, it’s not just Distribution – it is all kinds of support that’s built into a growing list of vendor products.