Big Data and Hadoop’s role in analyzing Big Data – Part 1
Here we will talk about challenges produced by Big Data and How Hadoop is addressing them. Previously we have understood that Hadoop is a framework for tools and its objective is to support running of applications on Big Data. So, why is that we require Hadoop to support running of applications on Big Data and how applications were run before? So, in this article we’ll talk about the challenges that Big Data is creating and why there was the need to create something like Hadoop.
Now, first let’s talk about big data; we understand that Hadoop is a set of tools that supports running of applications on big Data? There is no particular definition for a Big Data but here are some attributes of Big Data: Big Data is creating large and growing files on almost daily basis and these files are measured in terabytes and petabytes; terabyte is 10^12 and petabyte are 10^15. So, we are talking about a large number of files. One attribute of this big data is it unstructured; it’s not an organized data, in a relational database, in nicely created tables that have columns and you exactly know what kind of value is going to go into the column. So, it is unstructured and that creates a challenge.
Where is this data coming from? It’s coming from users like me and you, applications like FaceBook, systems like ticketing system, for example, sensors like in factories, etc. So, all these things plus many others are creating Big Data and in other words they’re creating large and growing files.
So we understand that Hadoop is a set of tools that supports running of applications on Big Data. so the keyword behind Hadoop is Big Data; Big Data is creating challenges that Hadoop is addressing. The challenges are created at three levels: a lot of data is coming in at very high speed, big volume of data has been gathered and it is growing exponentially and it is data of all sorts of variety, like it is not organized data and it has all sorts of audio, videos, files, log files and so on.
And so the above attributes of Big Data creates the challenge when it comes to processing the Big Data and when it comes to writing applications on it. Now looking deeper, velocity is the speed by which the data is coming in, for example four hundred million tweets are done daily on Twitter and one million transactions are handled by Wal-Mart every hour and this is creating a large volume of data altogether. So, that would be a second challenge like the velocity is one, the speed by which to date is coming in and the total data that is coming, the high-volume is another. For example, 2.5 petabyte is created by Wal-Mart transactions just in an hour.
And then variety is yet another challenge because this data, although it’s coming with very high speed and creating large files it might be manageable if it is structured data. But this is not an organized data like relational databases, stored in tables; this data is in all sorts of formats. Some are files, some are videos, some are audios and even within the files they are a great deal of difference. So, files are not in one standard format; some have data in this format and some in yet another format. Here are some examples of big data: videos, audios, images, photos log files, the click trails, text messages, emails, documents, books, transactions and public records.