Apache Flume – What it is and where we use
What is Flume?
“Flume is a distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of streaming event data.”
Now where do we use Flume? Consider a website where all the server logs are being logged in some place. Now we want to analyse those logs to know about the visitor information, the kind of visitors that are coming, the kind of activities our visitors are doing on our website, etc. and based on that we can change some features on our website so as to attract more traffic.
For example, say, by analysing the log file we might be able to find that a lot of people are unable to look at a particular call-to-action which could be a button, through which we say “enrol” or “buy” or “try”. So once we know that is where the problem lies, we can easily take steps to rectify it. Similarly, by looking at the logs we might be able to analyse where the users are going, what is their typical behaviour, etc.
So, all such information will be logged into our server logs and it will be stored there. What we would like to do, as a website owner, is to pull this data into HDFS before doing any analysis because this data will obviously be huge and doing the analysis without Hadoop will be very difficult.
So, the idea here is for moving any streaming data, event data or weblog kind of data into HDFS we should be using Flume just as we use Sqoop for moving any RDBMS data into HDFS.
Tags: Apache Flume and Hadoop