Monday, March 19, 2018

Environment - Big Data / Data Science



Why Hadoop?

·         Large Volumes of Data: Ability to store and process huge amounts of variety (structure, unstructured and semi structured) of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
·         Computing Power: Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
·         Fault Tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
·         Flexibility: Unlike traditional relational database, you don’t have to process data before storing it, You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos etc.
·         Low Cost: The open-source framework is free and used commodity hardware to store large quantities of data.
·         Scalability: You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
·          Introduction to Data and System
·          Types of Data
·          Traditional way of dealing large data and its problems
·          Types of Systems & Scaling
·          What is Big Data
·          Challenges in Big Data
·          Challenges in Traditional Application
·          New Requirements
·          What is Hadoop? Why Hadoop?
·          Brief history of Hadoop
·          Features of Hadoop
·          Hadoop and RDBMS
·          Hadoop Ecosystem’s overview
·          Installation in detail
·          Creating Ubuntu image in VMwareDownloading Hadoop
·          Installing SSH
·          Configuring Hadoop, HDFS & MapReduce
·          Download, Installation & Configuration Hive
·          Download, Installation & Configuration Pig
·          Download, Installation & Configuration Sqoop
·          Download, Installation & Configuration Hive
·          Configuring Hadoop in Different Modes


About Hadoop Distribute File System (HDFS)

·          File System – Concepts
·          Blocks
·          Replication Factor
·          Version File
·          Safe mode
·          Namespace IDs
·          Purpose of Name Node
·          Purpose of Data Node
·          Purpose of Secondary Name Node
·          Purpose of Job Tracker
·          Purpose of Task Tracker
·          HDFS Shell Commands – copy, delete, create directories etc.
·          Reading and Writing in HDFS
·          Difference of Unix Commands and HDFS commands
·          Hadoop Admin Commands
·          Hands on exercise with Unix and HDFS commands
·          Read / Write in HDFS – Internal Process between Client, NameNode & DataNodes.
·          Accessing HDFS using Java API
·          Various Ways of Accessing HDFS
·          Understanding HDFS Java classes and methods
·          Admin: 1. Commissioning / DeCommissioning DataNode
2.     Balancer  Replication Policy
3.     Network Distance / Topology Script


Map Reduce Programming and Exercises

·          About MapReduce
·          Understanding block and input splits
·          MapReduce Data types
·          Understanding Writable
·          Data Flow in MapReduce Application
·          Understanding MapReduce problem on datasets
·          MapReduce and Functional Programming
·          Writing MapReduce Application
·          Understanding Mapper function
·          Understanding Reducer Function
·          Understanding Driver
·          Usage of Combiner
·          Understanding Partitioner
·          Usage of Distributed Cache
·          Passing the parameters to mapper and reducer
·          Analysing the Results
·          Log files
·          Input Formats and Output Formats
·          Counters, Skipping Bad and unwanted Records
·          Writing Join’s in MapReduce with 2 Input files. Join Types.
·          Execute MapReduce Job – Insights.
·          Exercise’s on MapReduce.
·          Job Scheduling: Type of Schedulers.


Hive and scenarios

·          Hive concepts
·          Schema on Read VS Schema on Write
·          Hive architecture
·          Install and configure hive on cluster
·          Meta Store – Purpose & Type of Configurations
·          Different type of tables in Hive
·          Buckets
·          Partitions
·          Joins in hive
·          Hive Query Language
·          Hive Data Types
·          Data Loading into Hive Tables
·          Hive Query Execution
·          Hive library functions
·          Hive UDF
·          Hive Limitations

Pig - Setup and sample use cases

·          Pig basics
·          Install and configure PIG on a cluster
·          PIG Library functions
·          Pig Vs Hive
·          Write sample Pig Latin scripts
·          Modes of running PIG
·          Running in Grunt shell
·          Running as Java program
·          PIG UDFs

HBase

·          HBase concepts
·          HBase architecture
·          Region server architecture
·          File storage architecture
·          HBase basics
·          Column access
·          Scans
·          HBase use cases
·          Install and configure HBase on a multi node cluster
·          Create database, Develop and run sample applications
·          Access data stored in HBase using Java API

Sqoop

·          Install and configure Sqoop on cluster
·          Connecting to RDBMS
·          Installing Mysql
·          Import data from Mysql to hive
·          Export data to Mysql
·          Internal mechanism of import/export

Oozie

·          Introduction to OOZIE
·          Oozie architecture
·          XML file specifications
·          Specifying Work flow
·          Control nodes
·          Oozie job coordinator

Flume

·          Introduction to Flume
·          Configuration and Setup
·          Flume Sink with example
·          Channel
·          Flume Source with example
·          Complex flume architecture

ZooKeeper

·          Introduction to ZooKeeper
·          Challenges in distributed Applications
·          Coordination
·          ZooKeeper : Design Goals
·          Data Model and Hierarchical namespace
·          Cilent APIs

YARN and an introduction to Hadoop 2.0

·          Hadoop 1.0 Limitations
·          MapReduce Limitations
·          History of Hadoop 2.0
·          HDFS 2: Architecture
·          HDFS 2: Quorum based storage
·          HDFS 2: High availability
·          HDFS 2: Federation
·          YARN Architecture
·          Classic vs YARN
·          YARN Apps
·          YARN multitenancy
·          YARN Capacity Scheduler








Steps to setup environment :

1)  Setup  virtual box ( Use  VDI, VHD or VMDK)

            VDI         -  VirtualBox Disk Image.
            VHD       -   Virtual Hard Disk.
            VMDK   -    Virtual Machine Disk.


              VDI is the native format of VirtualBox. 
              VMDK is developed by and for VMWare, but Sun xVM, QEMU, VirtualBox, SUSE                                 Studio, and .NET DiscUtils also support it. (This format might be the most apt for you                       because you want virtualization software that would run fine on Ubuntu.)
              VHD is the native format of Microsoft Virtual PC. This is a format that is popular with                        Microsoft products.

2)  Create  VDI or VHD or VMDK - Image.

3)  Install Ubuntu or whichever operating systems


Useful links using Hortonworks 

                 https://community.hortonworks.com/kb/list.html 

                 More info about hortonworks sandbox environment

                     https://hortonworks.com/tutorial/getting-started-with-hdf-sandbox/

No comments:

Post a Comment

Hyderabad Trip - Best Places to visit

 Best Places to Visit  in Hyderabad 1.        1. Golconda Fort Maps Link :   https://www.google.com/maps/dir/Aparna+Serene+Park,+Masj...