+919030385678 info@rcptec.com

Hadoop Developer Course Content

Course Content for Hadoop Developer

This Course Covers 100% Developer and 40% Administration Syllabus.

Introduction to BigData, Hadoop:-

  • Big Data Introduction
  • Hadoop Introduction
  • What is Hadoop? Why Hadoop?
  • Hadoop History?
  • Different types of Components in Hadoop?
  • HDFS, MapReduce, PIG, Hive, SQOOP, HBASE, OOZIE, Flume, Zookeeper and so on…
  • What is the scope of Hadoop?

Deep Drive in HDFS (for Storing the Data):-

  • Introduction of HDFS
  • HDFS Design
  • HDFS role in Hadoop
  • Features of HDFS
  • Daemons of Hadoop and its functionality
  • Name Node
  • Secondary Name Node
  • Job Tracker
  • Data Node
  • Task Tracker
    • Anatomy of File Wright
    • Anatomy of File Read
    • Network Topology
  • Nodes
  • Racks
  • Data Center
    • Parallel Copying using DistCp
    • Basic Configuration for HDFS
    • Data Organization
  • Blocks and
  • Replication
    • Rack Awareness
    • Heartbeat Signal
    • How to Store the Data into HDFS
    • How to Read the Data from HDFS
    • Accessing HDFS (Introduction of Basic UNIX commands)
    • CLI commands

 

MapReduce using Java (Processing the Data):-

  • The introduction of MapReduce.
  • MapReduce Architecture
  • Data flow in MapReduce
    • Splits
    • Mapper
    • Portioning
    • Sort and shuffle
    • Combiner
    • Reducer
  • Understand Difference Between Block and InputSplit
  • Role of RecordReader
  • Basic Configuration of MapReduce
  • MapReduce life cycle
    • Driver Code
    • Mapper
    • and Reducer
  • How MapReduce Works
  • Writing and Executing the Basic MapReduce Program using Java
  • Submission & Initialization of MapReduce Job.
  • File Input/Output Formats in MapReduce Jobs
    • Text Input Format
    • Key Value Input Format
    • Sequence File Input Format
    • NLine Input Format
  • Joins
    • Map-side Joins
    • Reducer-side Joins
  • Word Count Example
  • Partition MapReduce Program
  • Side Data Distribution
    • Distributed Cache (with Program)
  • Counters (with Program)
    • Types of Counters
    • Task Counters
    • Job Counters
    • User Defined Counters
    • Propagation of Counters
  • Job Scheduling

PIG:-

  • Introduction tApache PIG
  • Introduction tPIG Data Flow Engine
  • MapReduce vs. PIG in detail
  • When should PIG use?
  • Data Types in PIG
  • Basic PIG programming
  • Modes of Execution in PIG
    • Local Mode and
    • MapReduce Mode
  • Execution Mechanisms
    • Grunt Shell
    • Script
    • Embedded
  • Operators/Transformations in PIG
  • PIG UDF’s with Program
  • Word Count Example in PIG
  • The difference between the Map
  • Reduce and PIG

SQOOP:-

  • Introduction tSQOOP
  • Use of SQOOP
  • Connect tmySql database
  • SQOOP commands
    • Import
    • Export
    • Eval
    • Codegen etc…
  • Joins in SQOOP
  • Export tMySQL
  • Export tHBase

HIVE:-

  • Introduction tHIVE
  • HIVE Meta Store
  • HIVE Architecture
  • Tables in HIVE
    • Managed Tables
    • External Tables
  • Hive Data Types
    • Primitive Types
    • Complex Types
  • Partition
  • Joins in HIVE
  • HIVE UDF’s and UADF’s with Programs
  • Word Count Example

HBASE:-

  • Introduction tHBASE
  • Basic Configurations of HBASE
  • Fundamentals of HBase
  • What is NoSQL?
  • HBase Data Model
    • Table and Row
    • Column Family and Column Qualifier
    • Cell and its Versioning
  • Categories of NoSQL Data Bases
    • Key-Value Database
    • Document Database
    • Column Family Database
  • HBASE Architecture
    • HMaster
    • Region Servers
    • Regions
    • MemStore
    • Store
  • SQL vs. NOSQL
  • How HBASE is differed from RDBMS
  • HDFS vs. HBase
  • Client-side buffering or bulk uploads
  • HBase Designing Tables
  • HBase Operations
    • Get
    • Scan
    • Put
    • Delete

MongoDB:–

  • What is MongoDB?
  • Where tUse?
  • Configuration On Windows
  • Inserting the data intMongoDB?
  • Reading the MongoDB data.

Cluster Setup:–

  • Downloading and installing the Ubuntu12.x
  • Installing Java
  • Installing Hadoop
  • Creating Cluster
  • Increasing Decreasing the Cluster size
  • Monitoring the Cluster Health
  • Starting and Stopping the Nodes

Zookeeper

  • Introduction Zookeeper
  • Data Modal
  • Operations

OOZIE

  • Introduction tOOZIE
  • Use of OOZIE
  • Where tuse?

Flume

  • Introduction tFlume
  • Uses of Flume
  • Flume Architecture
  • Flume Master
  • Flume Collectors
  • Flume Agents

Project Explanation with Architecture

Various modern tools that come under Hadoop ecosystem

Here we shall discuss a little about the different components of the Hadoop ecosystem. First let us start with what is meant by Hadoop ecosystem. HDFS and MapReduce are the core components of Hadoop framework on which Big Data is stored and processed in a distributed manner. Hadoop ecosystem refers to a set of tools that help in storage and processing of Big Data. Another important fact is that the members of the Hadoop ecosystem are always increasing. What is happening is that all the tools that are based on distributed technology are getting integrated with Hadoop framework with time; this way the possible use cases increase substantially.

The first one that needs to be discussed here is Pig: it is a tool that used cryptic statements to process the data. It would take an enormous amount of time and effort to write multiple jobs in languages like Java or Python. The pig is a relatively simple data flow language that cuts down on development time and efforts. Typically it was designed for data scientists who have less time and programming skills.

Hive provides SQL-like language tool that runs on top of MapReduce. Pig and Hive were developed at different places; Pig was developed by Yahoo and Hive by FaceBook but with the same idea in mind. Both of these tools were designed to aid data scientists with poor programming skills to process the data. So, observe that both Pig and Hive are above MapReduce layer; the code written in Pig or Hive gets converted to MapReduce jobs that are then run on HDFS.

To facilitate the movement of data into or out of Hadoop, the tools Flume and Scoop were created. Scoop helps in moving the data from a relational database and Flume is used to ingest the data as it is generated by an external source. Then there are tools like Impala, which is used for low-latency queries. HBase is another tool that provides features like a real-time database for retrieving the data from HDFS, and there are many other similar tools that provide different functionalities as well.

The biggest problem with all these tools is that they have been developed independently and parallel by various organizations. For example, when Yahoo came up with Pig, FaceBook came up with Hive, and they made the tools open source to be used by everybody. So, what has happened is there are a lot of compatibility issues between the two tools. This is where Cloudera and Hortonworks come into the picture, and they score points. They also package all the open source components and add their flavor and release their versions of packages. Hence, these packages released by them have all the ecosystem components, and they are compatible as well. Hence, the business model is to keep the products open source and free of charge and charge for the services.

ENQUIRE US NOW