Hadoop MapReduce – The initial preparation steps
Today we are going to explore the world of MapReduce; but before we get into the programming MapReduce we need to talk about preparation steps that are commonly taken and because MapReduce is operating on very large amounts of data we need to talk a little bit more in detail about what consideration we need to have the data that you’re going to be working with on your MapReduce job/jobs. We like to remind you that the underlying structure of the HDFS file system is very different from the traditional file systems in that the block sizes are quite a bit larger; the actual block size for your clusters depends on the cluster configuration. Currently, the options are 64 and 128 MB; so you have to think about where your files are going to be partitioned and if you need to do any serve custom partitioning based on those block sizes.
Another consideration is where you’re going to retrieve your data from in order to perform the MapReduce operations or the parallel processing on it. Although as core Hadoop you will work with the Hadoop file system it is possible to execute MapReduce algorithms against information stored in different locations such as the native file system and cloud storage buckets such as Amazon S3 or Windows Azure Blob. That’s an additional consideration; obviously if you’re going to think about cloud you have to also think about transport up and down and for conceive synchronization and those kinds of considerations. Another consideration is that the output of the MapReduce job results is immutable; so you’re output is a one-time output and when a new output is generated you have a new file name.
So, the final consideration in preparing for MapReduce is thinking about the logic that you’ll be writing; you will have, of course, your data import, whatever particular business situation trying to address, and then you will be writing logic in some programming language or library or some two or combinations, then map that data, then reduce it and then you have some output. One of the concepts that we want to have you start thinking about is when you’re working with the data coming through the MapReduce pipelines, whether it’s a single job or multiple jobs, you’re going to be working with key-value pairs.
So, regardless of the format of the data coming in and as you will see, it is very commonly expressed as large text files. That certainly is not the only type information you can bring in; there are many different formats supported, virtually unlimited. But a text file is simplest to understand example; so the idea is a large text file with words separated by spaces and you have line separators as well. For some of these text files you want output key/value pairs and this can be on based on the types that ship with MapReduce or sometimes you have to create your own types. So, you have these four considerations when you’re starting to plan and prepare for MapReduce: loading the files, where the data is coming from, where the output will go to, (again it can go to HDFS, it can go back into the cloud or into the file system or any other location you want to put it) and then the business problems that you’re trying to address with this framework and how you’re going to write your code to address this and with what languages and tools.