Parts that make up Hadoop, its applications and future prospects
So, drilling in and with a little bit more depth, as we mentioned, Hadoop is a set of Apache frameworks but it also includes more. Now as a point of your information you could run Hadoop, as some people do, with just the binaries which would consist of the data storage as the HDFS, which is also called the Hadoop core, running on commodity hardware usually running on the Linux operating system; of course this is horizontally scale as we’ve been mentioning. Now this will also work on some other operating system depending on the distribution; Microsoft has actually worked with Hortonworks and other port over to Windows although it still in private data, although the most common operating system for HDFS is Linux at this time.
Now on top of the Hadoop core layer the binaries include the MapReduce API, which allows job-based parallelizable processing across data in the HDFS structures. Also MapReduce has fault tolerance builds into its algorithm execution; by default HDFS cluster stores each piece of data three times and the MapReduce algorithm expects that there will be hardware failures in this commodity hardware so it has automatic retry built-in. There are many knobs and widgets you can tune, so to speak, in the MapReduce algorithm and the server configuration.
Now sitting on top of most commercial distributions of Hadoop are the other tools and frameworks which can improve the usability and usefulness in your particular business situation. those are in the forms of libraries and tools so the data access layers, the most common ones that we see are HBase, Hive, Pig and Mahout.
On top for that some of the vendors provide additional higher-level tools and libraries; Hue from Cloudera, for example, example is a tool that is commonly used; it’s a graphical user interface which makes working with Hadoop a little bit simpler when you’re first starting. Now, most commonly when you see the Apache distribution, you see people working directly from the Hadoop command line and we tend to use Gooies to introduce topics when they’re available because the other is just a little quicker. Scoop is a library of particular interest because it allows interoperability between relational databases, in particular SQL server and HDFS.
And then sitting on top of most of the implementations of Hadoop are some monitoring and alerting tools. These are usually ones that you purchase from various vendors; can be a Hadoop vendors such as CloudEra, MapR Hortonworks or an integration vendor such as GreenPlum, there are some other multi-cloud vendors that are providing data solutions across different vendor clouds such as RightScale that have tools as well.
There are many companies that are using Hadoop and have been using for years; the most commonly known are Facebook and a lot of Hadoop access technologies were actually developed by Facebook and the other very well-known user of Hadoop which is Yahoo. In fact, one of the major commercial distributions of Hadoop called Hortonworks is a group a former Yahoo employees who are building tools and services on top of the Apache distribution. Other companies using Hadoop are Amazon, eBay, American Airlines, New York Times, Federal Reserve Board, IBM and Orbitz .Apart from these American companies there are many companies in Europe and other parts of the world that are also using Hadoop.
Another important consideration when we’re taking the time to learn about the new technologies that Hadoop encompasses is our employability. Big Data and the methods of working with larger and larger sets of data is a very hot topic in the data world and this improves the chances of employability of people with Hadoop skills. Now the Hadoop skills in particular are Hadoop administration skills, which really up will be covered in any depth in this introduction, but also and probably the most on employable skills for Hadoop is the ability to use appropriately right and work with MapReduce jobs.