Why Hadoop?
Large Volumes of Data: Ability to store and process huge amounts of variety (structure,
unstructured and semi structured) of data, quickly. With data volumes and
varieties constantly increasing, especially from social media and the Internet
of Things (IoT), that’s a key consideration.
Computing Power: Hadoop’s distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
Fault Tolerance: Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes
to make sure the distributed computing does not fail. Multiple copies of all
data are stored automatically.
Flexibility: Unlike traditional relational database, you don’t have to process
data before storing it, You can store as much data as you want and decide how
to use it later. That includes unstructured data like text, images and videos
Low Cost: The open-source framework is free and used commodity hardware to
store large quantities of data.
Scalability: You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.
· Introduction to Data and System
· Types of Data
· Traditional way of dealing large data and its problems
· Types of Systems & Scaling
· What is Big Data
· Challenges in Big Data
· Challenges in Traditional Application
· New Requirements
· What is Hadoop? Why Hadoop?
· Brief history of Hadoop
· Features of Hadoop
· Hadoop and RDBMS
· Hadoop Ecosystem’s overview
Installation in detail
Creating Ubuntu image in
VMwareDownloading Hadoop
Installing SSH
Configuring Hadoop, HDFS
& MapReduce
Download, Installation &
Configuration Hive
Download, Installation &
Configuration Pig
Download, Installation &
Configuration Sqoop
Download, Installation &
Configuration Hive
Configuring Hadoop in
Different Modes
About Hadoop Distribute File System
File System – Concepts
Replication Factor
Version File
Safe mode
Namespace IDs
Purpose of Name Node
Purpose of Data Node
Purpose of Secondary Name
Purpose of Job Tracker
Purpose of Task Tracker
HDFS Shell Commands – copy,
delete, create directories etc.
Reading and Writing in HDFS
Difference of Unix Commands
and HDFS commands
Hadoop Admin Commands
Hands on exercise with Unix
and HDFS commands
Read / Write in HDFS –
Internal Process between Client, NameNode & DataNodes.
Accessing HDFS using Java
Various Ways of Accessing
Understanding HDFS Java
classes and methods
Admin: 1. Commissioning /
DeCommissioning DataNode
2. Balancer Replication Policy
Network Distance / Topology
Map Reduce Programming and Exercises
About MapReduce
Understanding block and
input splits
MapReduce Data types
Understanding Writable
Data Flow in MapReduce
Understanding MapReduce
problem on datasets
MapReduce and Functional
Writing MapReduce
Understanding Mapper
Understanding Reducer
Understanding Driver
Usage of Combiner
Understanding Partitioner
Usage of Distributed Cache
Passing the parameters to
mapper and reducer
Analysing the Results
Log files
Input Formats and Output
Counters, Skipping Bad and
unwanted Records
Writing Join’s in MapReduce
with 2 Input files. Join Types.
Execute MapReduce Job –
Exercise’s on MapReduce.
Job Scheduling: Type of
Hive and scenarios
Hive concepts
Schema on Read VS Schema on
Hive architecture
Install and configure hive
on cluster
Meta Store – Purpose &
Type of Configurations
Different type of tables in
Joins in hive
Hive Query Language
Hive Data Types
Data Loading into Hive
Hive Query Execution
Hive library functions
Hive UDF
Hive Limitations
Pig - Setup and sample use cases
Pig basics
Install and configure PIG on
a cluster
PIG Library functions
Pig Vs Hive
Write sample Pig Latin
Modes of running PIG
Running in Grunt shell
Running as Java program
HBase concepts
HBase architecture
Region server architecture
File storage architecture
HBase basics
Column access
HBase use cases
Install and configure HBase
on a multi node cluster
Create database, Develop and
run sample applications
Access data stored in HBase
using Java API
Install and configure Sqoop
on cluster
Connecting to RDBMS
Installing Mysql
Import data from Mysql to
Export data to Mysql
Internal mechanism of
Introduction to OOZIE
Oozie architecture
XML file specifications
Specifying Work flow
Control nodes
Oozie job coordinator
Introduction to Flume
Configuration and Setup
Flume Sink with example
Flume Source with example
Complex flume architecture
Introduction to ZooKeeper
Challenges in distributed
ZooKeeper : Design Goals
Data Model and Hierarchical
Cilent APIs
YARN and an introduction to Hadoop 2.0
Hadoop 1.0 Limitations
MapReduce Limitations
History of Hadoop 2.0
HDFS 2: Architecture
HDFS 2: Quorum based storage
HDFS 2: High availability
HDFS 2: Federation
YARN Architecture
Classic vs YARN
YARN multitenancy
YARN Capacity Scheduler
Steps to setup environment :
1) Setup virtual box ( Use VDI, VHD or VMDK)
VDI - VirtualBox Disk Image.
VHD - Virtual Hard Disk.
VMDK - Virtual Machine Disk.
VDI is the native format of VirtualBox.
VMDK is developed by and for VMWare, but Sun xVM, QEMU, VirtualBox, SUSE Studio, and .NET DiscUtils also support it. (This format might be the most apt for you because you want virtualization software that would run fine on Ubuntu.)
VHD is the native format of Microsoft Virtual PC. This is a format that is popular with Microsoft products.
2) Create VDI or VHD or VMDK - Image.
3) Install Ubuntu or whichever operating systems
Useful links using Hortonworks
More info about hortonworks sandbox environment
