BigData Hadoop Notes
Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to manage and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single dataset.
It is the term for a collection of data sets, so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Big data is the term which defines three characteristics
Already we have RDBMS to store and process structured data databases and tables in the form of row and columns. But of late we have been getting data in a form of videos, images and text. This data is called as unstructured data and semistructured data. This data can not be stored and can’t be the process by RDBMS.
So Definitely we have to find an alternative way to store and to process this type of unstructured and semistructured data.
For this reason, we HADOOP to store and to process large set of data. This HADOOP is entirely different from Traditional distributed file system. it can overcome all the problems exists in the traditional distributed systems.
The Motivation For Hadoop
• What problems exist with ‘traditional’ large-scale computing systems
• What requirements an alternative approach should have
• How Hadoop addresses those requirements
Problems with Traditional Large-Scale Systems
• Traditionally, computation has been processor-bound Relatively small amounts of data
• For decades, the primary push was to increase the computing power of a single machine
• Distributed systems evolved to allow developers to use multiple machines for a single job
Distributed Systems: Data Storage
• Typically, data for a distributed system is stored on a SAN
• At compute time, data is copied to the compute nodes
• Fine for relatively limited amounts of data
Distributed Systems: Problems
• Programming for traditional distributed systems is complex
• Data exchange requires synchronization
• Finite bandwidth is available
• Temporal dependencies are complicated
• It is difficult to deal with partial failures of the system
The Data-Driven World
Modern systems have to deal with far more data than was the case in the past
• Organizations are generating huge amounts of data
• That data has inherent value and cannot be discarded
• Examples: Facebook — over 70PB of data eBay — over 5PB of data
Many organizations are generating data at a rate of terabytes per day. Getting the data to the processors becomes the bottleneck
Requirements for a New Approach
Partial Failure Support
The system must support partial failure
• Failure of a component should result in a graceful degradation of application performance. Not complete failure of the entire system.
• If a component of the system fails, its workload should be assumed by still-functioning units in the system
• Failure should not result in the loss of any data
• If a component of the system fails and then recovers, it should be able to rejoin the system.
• Without requiring a full restart of the entire system
• Component failures during execution of a job should not affect the outcome of the job
• Adding the load to the system should result in a graceful decline in performance of individual jobs Not failure of the system
• Increasing resources should support a proportional increase in load capacity.
• Hadoop is based on work done by Google in the late 1990s/early 2000s.
• Specifically, on papers describing the Google File System (GFS).
• published in 2003, and MapReduce published in 2004.
• This work takes a radical new approach to the problems of Distributed computing so that it meets all the requirements of reliability and availability.
• This core concept is distributing the data as it is initially stored in the system.
• Individual nodes can work on data local to those nodes so data cannot be transmitted over the network.
• Developers need not to worry about network programming, temporal dependencies or low-level infrastructure.
• Nodes can talk to each other as little as possible. Developers should not write code which communicates with nodes.
• Data spread among the machines in advance so that computation happens where the data is stored, wherever possible.
• Data is replicated multiple times on the system for increasing availability and reliability.
• When data is loaded into the system, it splits the input file into ‘blocks ‘, typically 64MB or 128MB.
• Map tasks generally work on relatively small portions of data that is typically a single block.
• A master program allocates work to nodes such that a map task will work on a block of data stored locally on that node whenever possible.
• Nodes work in parallel to each of their own part of the dataset.
• If a node fails, the master will detect that failure and re-assigns the work to some other node in the system.
• Restarting a task does not require communication with nodes working on other portions of the data.
• If failed node restarts, it is automatically add back to the system and will be assigned to new tasks.
what is speculative execution?
If a node appears to be running slowly, the master node can redundantly execute another instance of the same task and a first output will be taken. This process is called as speculative execution.
Hadoop consists of two core components
There are many other projects based on core concepts of Hadoop. All these projects are called as Hadoop Ecosystem.
Hadoop Ecosystem has
Oozie and so on.
A set of machines running HDFS and MapReduce is known as Hadoop cluster and Individual machines are known as nodes.
A cluster can have as few as one node or as many as several thousands of nodes. So More nodes can give better performance.
HDFS is a file system designed for storing very large files with streaming data access patterns, running on cluster of commodity hardware.
Streaming data access in HDFS is built with an idea that the most data processing pattern is a Write Once And Read Many times pattern.
Low-latency data access pattern
Application that require low-latency access to data in that tens of milliseconds range but it will not work well with HDFS.
HDFS is optimized for delivering a high through put of data so the result will not come in seconds.
Data split into blocks and distributed across multiple nodes in the cluster.
Each block is typically 64MB or 128MB in size.
Each block is replicated multiple times. Default is replicated each block three times. Replicas are stored on different nodes this ensures both reliability and availability.
It provides redundant storage for massive amount of data using cheap and unreliable computers.
Files in HDFS are ‘write once’.
Different blocks from same file will be stored on different machines. This provides for efficient MapReduce processing.
There are five daemons in HDFS
NameNode keeps track of which blocks make up a file and where those blocks are located. These details of data are known as Metadata.
Example: NameNode holds metadata for the two files (Foo.txt, Bar.txt)
Datanode holds the actual’ data blocks
Each block Is of G4MB or 128M13 In size. Each block is replicated three times on the cluster.
The NameNode Daemon must be running at all times. If the NameNode gets stopped, the cluster becomes inaccessible.
System administrator will take care to ensure that the NameNode hardware is reliable.
A separate daemon known as SecondaryNameNode takes care of some housekeeping task for the NameNode. But this SecondaryNameNode is not a backup NameNode. But it is a backup for metadata of NameNode.
Although files are splitted into 64MB or 128MB blocks, if a file is smaller than this the full 64MB or 128MB will not be used.
Blocks are stored as standard files on Datallodes, in a set of directories specified in Hadoop configuration file.
Without metadata on the NameNode, there is no way to access the files in the HDFS.
When client application wants to read a file,
It communicates with the NameNode to determine which blocks make up the file and which Datallodes those blocks reside on.
It then communicates directly with the Datallodes to read the data.
• Not a hot standby for the Name Node
• Connects to Name Node every hour*
• Housekeeping, backup of Name Node metadata
• Saved metadata can rebuild a failed Name Node