+91 70951 67689 datalabs.training@gmail.com

Hadoop and Yarn – Ways to utilize ideal nodes in a cluster


Here we are going to learn how to do things better in a Hadoop cluster using Yarn. So, we have the Hadoop, which is an open-source framework and we are using it to analyze our Big Data. So, we have to pay for the MapReduce cluster, maybe it is running on Amazon and generally in a MapReduce cluster there are a lot of nodes that are not used; well, who gets 100% utilization these days, anyway. So these nodes are ideal; so what can we do with them. Maybe we could run something more than MapReduce because if we want to do some real time analysis we can’t do it with MapReduce. We could go for some other thing else like Storm or some other real time streaming engine. Or we could run a message passing algorithm over this data.


So, what can we do with the idle nodes that are available in the cluster? A Message Passing App is something one can go for; this is also a distributed application. It runs in the cluster; it runs several agents. All together, the state of the agents makeup the state of the application and they send messages to one another and manipulate the data. So, this is a pretty common pattern for all message passing applications.


Another type of application we could run is the streaming app; here real time events come into the system and they are processed in a pipeline of steps. It is possible for the pipeline to branch if required but generally it runs in a distributed pattern and it is scalable.


Alternatively we can do something very simple; now that we have developed the web app an important thing to do is load testing. So, we can run a distributed load test which is very simple; we just run n number of nodes and they all do the same thing; they just hammer that web service. This is something quite easy to do if one has spare computer resources.


And all these things are done by a number of companies globally. For example a company called Continuity has built a Big Data platform which allows users to run all the above applications and much more in a single cluster, which makes it really easy. Right now, they are running things like real time stream processing, ad-hoc queries, Map/Reduce job, etc in a Hadoop cluster. So, when doing all this they have found they wanted a multi-purpose cluster where all these different types of applications can run at the same time and co-exist with each other. And they have actually made this possible by developing something called Yarn.