+91 70951 67689 datalabs.training@gmail.com

Creation of Hadoop – Role of Google, Yahoo and Apache


Hadoop is a set of tools that have been created to manage Big Data. So now let us look from very high level what lead to the creation of Hadoop. If you look at the architecture or Hadoop from a very high level we have our MapReduce that consists of job tracker at the master node and task trackers at the slave nodes. The file system known as Hadoop Distributed File System, HDFS consists of name node at the master computer end, date nodes, etc all which very closely resemble what Google has. Google has Google File System known as GFS and then there is also MapReduce. Why is that? Why is Hadoop’s architecture very similar to what Google uses?


There is a story behind that; in 1990s how would you search for something on the web? You would choose one of your favourite search engines; that could be Excite, Altavista, Lycos or maybe some other and you will do a search. Or maybe you will use a tool that will do a search on all these search engines and it will give you a combined result. This is how the search looks like in 1990. Then something happened; then Google came into the picture and before we knew it, it was suddenly number one search engine just by the word of mouth. People started using Google and they started telling each other Google is the best search engine that works for them.


So, why didn’t that happen? How come Google achieved that kind of victory? Behind the Google’s victory were the two technology pieces – MapReduce and the Google file system. Google would have tons of computers at their data centre and Big Data that Google deals with will be broken down into smaller equal pieces and each piece will be sent to you a different computer. It will be the master node that will keep information on which data resides on which computer and applications can connect to that master note to find out where the data is residing. This was the central idea behind Google’s innovation to create a distributed file system.


Dividing the data into smaller pieces and sending each piece to a different computer is part of the story; the real strength comes from the tool they developed called the MapReduce. As the name suggests, MapReduce will take the high-level task or computation and it will break it down into pieces and it will spend each piece on the same computer where the piece of data is sitting. So now instead of performing one big computation on one big datum, smaller computations will be performed on each smaller piece of data. And finally, the results of each smaller computation will be gathered and will be aggregated and this will be the answer to the query. So, this will result in much faster performance as compared to dealing with the big data directly. So, this was the technology and innovation that Google came up to win the battle of search engines.


However, in 2003 Google released a paper on Google filesystem and this is the first time they told the world how they are maintaining the data using GFS or Google FileSystem. Knowing about GFS to is the part of the story that did not quench the thirst because the world did not know yet how Google is dealing with the competitions. In 2004, Google released other papers and in this they told the world about MapReduce to perform processing on Big Data. This was the first time the world came to realize how Google is doing what they’re doing.


In 2005, Doug Cutting and Michael Cafarella were working for Yahoo and they got very interested in the papers that were released by Google in 2004. They started reading the papers and they started creating something based on that papers and Hadoop was born as a result. So Hadoop was born in 2005 but the technology behind Hadoop was created way before by Google and released to the world in 2003 and 2004. Yahoo started creating a search engine and it was called Nutch Search Engine Project in 2005 and to support that project a similar technology was required that Google was using. Hence Doug Cutting and Michael Cafarella created something out of those papers and it was named Hadoop.


The name Hadoop is indeed strange and it was the name of a toy elephant that Doug’s son used to play with and in fact it was Doug’s son that invented that name and gave it to the toy elephant and it was borrowed by Doug for his project. In 2006, Yahoo donated the Hadoop project to Apache and that is how Apache came into the picture. So, the idea wasn’t invented by Google and the product was created by Google as well and based on their papers Yahoo created the product and it was then handed over to Apache.