Why Hadoop?
·
Large Volumes of Data: Ability to store and process huge amounts of variety (structure,
unstructured and semi structured) of data, quickly. With data volumes and
varieties constantly increasing, especially from social media and the Internet
of Things (IoT), that’s a key consideration.
·
Computing Power: Hadoop’s distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
·
Fault Tolerance: Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes
to make sure the distributed computing does not fail. Multiple copies of all
data are stored automatically.
·
Flexibility: Unlike traditional relational database, you don’t have to process
data before storing it, You can store as much data as you want and decide how
to use it later. That includes unstructured data like text, images and videos
etc.
·
Low Cost: The open-source framework is free and used commodity hardware to
store large quantities of data.
·
Scalability: You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.
· Introduction to Data and System
· Types of Data
· Traditional way of dealing large data and its problems
· Types of Systems & Scaling
· What is Big Data
· Challenges in Big Data
· Challenges in Traditional Application
· New Requirements
· What is Hadoop? Why Hadoop?
· Brief history of Hadoop
· Features of Hadoop
· Hadoop and RDBMS
· Hadoop Ecosystem’s overview
·
Installation in detail
·
Creating Ubuntu image in
VMwareDownloading Hadoop
·
Installing SSH
·
Configuring Hadoop, HDFS
& MapReduce
·
Download, Installation &
Configuration Hive
·
Download, Installation &
Configuration Pig
·
Download, Installation &
Configuration Sqoop
·
Download, Installation &
Configuration Hive
·
Configuring Hadoop in
Different Modes
About Hadoop Distribute File System
(HDFS)
·
File System – Concepts
·
Blocks
·
Replication Factor
·
Version File
·
Safe mode
·
Namespace IDs
·
Purpose of Name Node
·
Purpose of Data Node
·
Purpose of Secondary Name
Node
·
Purpose of Job Tracker
·
Purpose of Task Tracker
·
HDFS Shell Commands – copy,
delete, create directories etc.
·
Reading and Writing in HDFS
·
Difference of Unix Commands
and HDFS commands
·
Hadoop Admin Commands
·
Hands on exercise with Unix
and HDFS commands
·
Read / Write in HDFS –
Internal Process between Client, NameNode & DataNodes.
·
Accessing HDFS using Java
API
·
Various Ways of Accessing
HDFS
·
Understanding HDFS Java
classes and methods
·
Admin: 1. Commissioning /
DeCommissioning DataNode
2. Balancer Replication Policy
3.
Network Distance / Topology
Script
Map Reduce Programming and Exercises
·
About MapReduce
·
Understanding block and
input splits
·
MapReduce Data types
·
Understanding Writable
·
Data Flow in MapReduce
Application
·
Understanding MapReduce
problem on datasets
·
MapReduce and Functional
Programming
·
Writing MapReduce
Application
·
Understanding Mapper
function
·
Understanding Reducer
Function
·
Understanding Driver
·
Usage of Combiner
·
Understanding Partitioner
·
Usage of Distributed Cache
·
Passing the parameters to
mapper and reducer
·
Analysing the Results
·
Log files
·
Input Formats and Output
Formats
·
Counters, Skipping Bad and
unwanted Records
·
Writing Join’s in MapReduce
with 2 Input files. Join Types.
·
Execute MapReduce Job –
Insights.
·
Exercise’s on MapReduce.
·
Job Scheduling: Type of
Schedulers.
Hive and scenarios
·
Hive concepts
·
Schema on Read VS Schema on
Write
·
Hive architecture
·
Install and configure hive
on cluster
·
Meta Store – Purpose &
Type of Configurations
·
Different type of tables in
Hive
·
Buckets
·
Partitions
·
Joins in hive
·
Hive Query Language
·
Hive Data Types
·
Data Loading into Hive
Tables
·
Hive Query Execution
·
Hive library functions
·
Hive UDF
·
Hive Limitations
Pig - Setup and sample use cases
·
Pig basics
·
Install and configure PIG on
a cluster
·
PIG Library functions
·
Pig Vs Hive
·
Write sample Pig Latin
scripts
·
Modes of running PIG
·
Running in Grunt shell
·
Running as Java program
·
PIG UDFs
HBase
·
HBase concepts
·
HBase architecture
·
Region server architecture
·
File storage architecture
·
HBase basics
·
Column access
·
Scans
·
HBase use cases
·
Install and configure HBase
on a multi node cluster
·
Create database, Develop and
run sample applications
·
Access data stored in HBase
using Java API
Sqoop
·
Install and configure Sqoop
on cluster
·
Connecting to RDBMS
·
Installing Mysql
·
Import data from Mysql to
hive
·
Export data to Mysql
·
Internal mechanism of
import/export
Oozie
·
Introduction to OOZIE
·
Oozie architecture
·
XML file specifications
·
Specifying Work flow
·
Control nodes
·
Oozie job coordinator
Flume
·
Introduction to Flume
·
Configuration and Setup
·
Flume Sink with example
·
Channel
·
Flume Source with example
·
Complex flume architecture
ZooKeeper
·
Introduction to ZooKeeper
·
Challenges in distributed
Applications
·
Coordination
·
ZooKeeper : Design Goals
·
Data Model and Hierarchical
namespace
·
Cilent APIs
YARN and an introduction to Hadoop 2.0
·
Hadoop 1.0 Limitations
·
MapReduce Limitations
·
History of Hadoop 2.0
·
HDFS 2: Architecture
·
HDFS 2: Quorum based storage
·
HDFS 2: High availability
·
HDFS 2: Federation
·
YARN Architecture
·
Classic vs YARN
·
YARN Apps
·
YARN multitenancy
·
YARN Capacity Scheduler
Steps to setup environment :
1) Setup virtual box ( Use VDI, VHD or VMDK)
VDI - VirtualBox Disk Image.
VHD - Virtual Hard Disk.
VMDK - Virtual Machine Disk.
VDI is the native format of VirtualBox.
VMDK is developed by and for VMWare, but Sun xVM, QEMU, VirtualBox, SUSE Studio, and .NET DiscUtils also support it. (This format might be the most apt for you because you want virtualization software that would run fine on Ubuntu.)
VHD is the native format of Microsoft Virtual PC. This is a format that is popular with Microsoft products.
2) Create VDI or VHD or VMDK - Image.
3) Install Ubuntu or whichever operating systems
Useful links using Hortonworks
https://community.hortonworks.com/kb/list.html
More info about hortonworks sandbox environment
https://hortonworks.com/tutorial/getting-started-with-hdf-sandbox/
No comments:
Post a Comment