YARN – Yet Another Resource Nanny
So, what is Yarn? Well, when Hadoop was written it was written solely as a MapReduce engine. And since it has to run on a cluster it has cluster management component too. But it was tightly coupled with the MapReduce programming paradigm; so the only thing anybody could run was only a MapReduce job, which is obviously not that advantageous. So that has been changed in the recent version of Hadoop which is 2.0; so now the resource management of the cluster has been decoupled from the application management that actually understands the programming paradigm that is being run. So, the good outcome of that is we can now run different types of applications in a Hadoop cluster. And Yarn is similar to Hadoop in many ways; it’s written in Java and can be scaled to thousands of nodes.
Yarn just like Hadoop is written all in Java; it is multi-tenant, it has security and so one can actually run different client’s jobs in the cluster and they will not impact each other. And finally it is scaled to many thousands of nodes, but most people don’t require these many nodes. People who run Hadoop run a dozen nodes initially and maybe after a year upgrade to a hundred nodes and so very few people go to a thousand nodes. Yes, there are companies like Google, Yahoo, Twitter and FaceBook which run thousands of nodes because they have really big data. And Yarn works on a 12 nodes cluster as well as it works on a 5000 nodes cluster.
The full form of YARN is Yet Another Resource Nanny or manager or something like that. And this is how it works; it adds one machine to the cluster, which is the Yarn resource manager, and the resource manager knows about all the resources in the cluster. Then every application needs to add one node to its application which is the application master. Now the application manager communicates with the resource manager to obtain resources from the cluster. Once it gets these resources, it can start tasks and now it’s able to control the applications.
Here is how Yarn works: every node in the machine has a memory and a CPU; but Yarn takes a slightly more simplified approach. Every one of these computing machines has many compute slots, and a slot is a container in which one can run tasks. And these machines run a process and the process run, manages, the slots in the machine. One can run any job or task in the slot; one can run any shell command there. It is the work of the resource manager to assign slots to the applications. The application has a master and the master requests the manager to assign the slots and once it has the slots it can start the tasks there and has the slots under its control until it releases the tasks.