File compression in Hadoop – Problems and solutions
Organizations today are being faced with increasing volumes of data and also velocity of data as well. If you consider some that volumes of data companies are dealing with, they have to deal billions of records at day level. They need to be able to manage that data and be able to run analytics against. Now one of the new technologies or the new approaches for dealing with this deluge of data is data compression. It can lead to a reduction in hardware storage footprint required to actually store that data and by turning I/O band problems into CPU band problems it can lead to an improvement of efficiency with which you can analyze that data as well.
Organizations today typically compress data in Hadoop in either one of two ways; they may use LZO forms of compression or they may just use official gzip. So LZO is often used in Hadoop both as an input into Hadoop but also as a way of dealing with an intermediate result set. Now between LZO and gzip you can typically see between three to seven x compressions. When you balance that against the kinds of replication factors that are involved in Hadoop, where you may have a 3x level of replication within your Hadoop HDFS system, it can quite often counteract any benefit from compression directly. Our compression is much more aggressive than gzip and LCO. And we go beyond these techniques by using deduplication of individual values within records to go beyond the three to seven x you might seem traditional technologies and moving that compression factor from 10 to 20, 30 to 40 x.
Here we would like to explain how Rainstorm compresses and stores data on an existing Hadoop cluster. So, we might start off with a couple of nodes: node number one and node number two and we might have a data source. And then course in Hadoop cluster each of these nodes has an attached disk for storing their file objects and it’s connected to a network. So, quite often in Hadoop you’re dealing with CSV files; comma-separated files they may contain the raw data and typically what might be happening is that a CSV file is being sent to each node and say for example this CSV data might consist of records on a Telco network, for example, that you want to store within a Hadoop cluster and analyze at a later point. So, you may have large volumes of these CSV files being streamed to a Hadoop cluster and they may be compressed on the way and then stored as gzip files within Hadoop storage. Of course then we can lay MapReduce tasks to the data that is stored within nodes within the HDFS system and performance analytics going forward.
So, one of the problems with this system is the level of compression you get out of this system. Quite often, when we performed experiments using this kind of set up, we found that the compression just wasn’t high enough. And a lot of the architectural and hardware that’s involved in this system is actually being used to shuffle and marshal these files around as part of the MapReduce tasks. So, what’s happening is this problem becomes an I/O band bomb problem rather than a CPU band problem. So, the question is how can I improve the efficiency of my nodes when I’m dealing with what is effectively quite a low level of compression using traditional byte level compression.
This is a very interesting point because when you get some very large Hadoop clusters you find actually the CPU utilization can be as low as 10 percent. So when you’re rolling out more and more nodes effectively they are just being used is dumb storage blocks. So basically you are just adding them to your Hadoop cluster to increase your storage footprint and you are getting very little benefit out of it from the additional CPU that accompanies those nodes. So what we do is apply a deep level of compression to these individual file objects prior to them being analyzed within MapReduce jobs.
So let’s have a look at what happens, in this case: say a CSV file might come through into an individual node; then instead of a CSV file been compressed on the disk we have proprietary tree format files. So, for every CSV file you may get one or more compressed tree file being generated. Now these tree files have four levels a compression: The first level of compression is value level compression and that’s similar to what you see from columnar databases. We capture a unique set of values within a particular field; we don’t store all the values for every single record – we only store the unique ones; so we have a column level of deduplication there.
The next one is called pattern level deduplication: now this goes beyond identifying common values within different fields and looks for pairs of values and pair of pairs of values that are common to different records. Now this is something that traditional technologies in a data warehouse industry or indeed any compression technology that’s been applied to Hadoop, so far, can do. So this pushes up compression from may be 5 to 10 in a valuable level scenario, up to 20-30 times compression through the application and the identification of patterns.
Next we have arithmetic compression; At this point when we are persisting these tree objects onto disks we apply a number of different tips and tricks to allow us to reduce the overhead that we use when identifying and tracking these values and pattern levels of deduplication. Now finally we can optionally apply the byte level of compression. So it’s exactly the same source of technology you’d see in gzip.