Components of Apache Flume
Apache Flume is a component of the Hadoop framework that is responsible for continuously pushing data into HDFS. There are various components that make up Apache Flume; namely:
- Channel selectors
- Sink processor and
- Sink event
We need to understand each of the components of Flume to understand Flume in detail. Here is a pictographic representation of how some of the important components of Flume work together:
Suppose this is the application server that is producing the data; so it is the data generation area. Down below the data generation area comes the Flume part followed by the HDFS part where the Hadoop resides. The data generated will be first sent to the Source, which is a part of Apache Flume. Hence, Source is the component of Flume that receive data. We can consider this as input for Apache Flume. Along with Source there is another component of Apache Flume called Sink; this Sink writes data to HDFS. It could also write to HBase database as well; so it supports both HDFS and HBase as well.
Hence, in Apache Flume we can say the Source is the input part and the Sink is the output part. Between the input and the output there is another component of Apache Flume called Channel that acts as a glue; Channel works in between Source and Sink. Now, all the three components, namely Source, Sink and Channel, are run inside a Daemon process which is called the Agent. So, Agent is responsible for running all the three processes, viz. Source, Sink and Channel.
And now, as per our understanding, the data is transferred from the generation centre first to the Source. From the Source, it is to be then transferred to the Sink through the Channel before ultimately going into HDFS. During the transfer from the Source to the Sink, every data point is known as an Event and since it is happening inside Flume, it can be called a Flume Event. Hence, there will be a number of Flume events; these are the actual payload because data is contained in it.
Another important point that one should know here is that it is not necessary that you have a single channel. There can be multiple channels as well. So source could write data in the form of Event to one channel or more than one channel. Whether it uses one channel or more is based on how we configure it. So, in actuality Channel is the holding area where events are stored before they are passed to the Sink. And Sink processes the Events only through the Channel; it can not take the event directly from the Source. They have to come from the Channel only for Sink to process the Events. So, the Agent can have multiple Source, multiple Channels and multiple Sinks.
Tags: Total Components of Apache Flume