How Does Yarn work in a Hadoop cluster?
Here are more details on how Yarn works:
All the pink blocks indicate Yarn in the above schematic. Here there are four protocols involved:
- We have a client and it might be our laptop; from this laptop we want to submit an application. For that the first thing I have to do is start the application master, for which we have to use the client resource manager protocol.
- Once the master is running, it communicates with the resource manager to get the resources. And that will be the application manager-resource manager protocol.
- Now there will be a node manager running on every node and in order to start a job in the slot, the application manager cannot do that by itself. It has to submit the request to the node manager and the node manager will do it on behalf of the application manager. And so that is how it works. That looks quite simple, right?
And this is how all that is done in a step-by-step manner. The first step is to start the application manager in one of the slots. So to understand this you have to remember that all this runs in a Hadoop cluster and Hadoop has a distributed file system called HDFS. All the data that the jobs in the cluster can share are in this distributed file system. And how it works is every node has a local file system and it contributes a part to this distributed file system. So when a job is run locally on the node it sees only the local file system through the native file system protocols. So if we want to see things that are on the distributed file system then we need to go through the HDFS protocol – a different file system protocol/client. So, an important implication here is when we run some job, it can depend only on files that are on the local file system because it cannot understand HDFS.
So now, we want to start an application master in the cluster and the client has the job file with all the classes on his local file system. So if we say, Yarn start this class in one of the nodes, that slot in the cluster cannot access the local file system of the client. So, in some way we have to copy the job file into the local file system of that compute slot. The way to do is before submitting the job the client used the HDFS protocol to copy it to the distributed file system. It then sends a request to the resource manager and points it to the location on HDFS. The resource manager tells the node manager that it wants to start this job and the node manager then copies the job file from the distributed file system to the local file system and so the job can be started. So, you can see that it is a bit more complicated than what one expects at first.
When writing a Yarn client what are the steps we have to follow? For the first step, to start the application manager, we have to first connect to the resource manager. Now, step zero is copy the job file to HDFS. Next,
Step 1: We connect to the resource manager
Step 2: then we request a new application ID and from now on, when we have to talk to the resource manager we have to use that particular ID.
Step 3: Then we create a submission context that identifies the content and then we have to create a container launch context that describes the container, requirements for the container in which I want to run the master.
Step 4: Then in that context we can define the local resources, what are the job files and other resources it needs copied to the local file system.
Step 5: Then we have to define the environment and the environment variables have to be set
Step 6: Then we have to define the command; the command is actually shell command.
Step 7: Then we give it resource limits; we can say this particular container can only use 2 GB if memory – if it goes above that kill it.
Step 8: Once all that is done we can submit the request to start the application manager.
Thus, all these steps should be taken care of when writing the code.