Understanding Big Data with examples
The data warehouse gave a pretty accurate insight to the data and was of a great help in providing a stable support system, but it had some shortcomings. The major disadvantage with data warehouse was that the algorithm ran on a small sample of data collected from various sources. If the size of the data sample was to increase the server would take very long time to compute all the data and derive meaningful results out of it. So in a few business scenarios it was like looking into the room through a keyhole and finding the size and shape of the room.
Secondly, another shortcoming was, the process of collecting data from various sources and cleaning and organizing it and then running heavy analytics on it took a lot of turnaround time which made the results go stale. It was like deciding to cross the road based on a picture taken 5 minutes ago. So basically, it was the lack of capability to process lots of data and the ability to do it quickly was the start of problems. This lots of data is known as Big Data and the tools with which we can process and analyze Big Data are known as Big Data tools.
Hadoop is one of the Big Data tools; with Hadoop one can sample much large amount of data at a much quicker throughput and at a much lower price than the existing data warehousing tools. So that is a very simple take on understanding Big Data and Hadoop. Now let us take a textbook definition and see if all this makes sense.
The textbook definition of Big Data is, “Big Data are a collection of data sets so large and complex that it becomes difficult using on-hand database management tools or traditional data processing applications”. Let us break it into a part to ensure that we understand it completely. Big Data are a collection of datasets: as we saw in the previous example the organizations are collecting data from a lot of data sources so it is a collection. The datasets are large and complex: of course so; since an organization would like to consider as much as data possible ideally, so as to arrive at the most accurate results. And that is why it became difficult for the traditional “database management tools or traditional data processing applications” to process the data as the volume of the data size increased.
This example as brings out the three V attributes that are used to describe the Big Data problems, namely Volume, Variety and Velocity. Volume reflects the large amount of data that needs to be processed; as the various datasets are stacked together, the amount of data increases. The variety reflects the different sources of data; it can vary from web-server logs to structured data from databases to unstructured data from social media. And the third V – Velocity indicates the amount of data that keeps accumulating with time. Although the three Vs are a good description of Big Data problems and may be found in your organization, these are just the guidelines to Big Data problems! There can be Big Data problem scenarios where only two Vs are applicable or even a single V is applicable. Can you think about such scenarios?