Saturday, September 22, 2018

Big Data Admin


I. Introduction to Hadoop—Architecture and Hadoop Clusters :

1. Introduction to Hadoop and Its Environment 

  • Hadoop—An Introduction
  • Unique Features of Hadoop
  • Big Data and Hadoop
  • A Typical Scenario for Using Hadoop
  • Traditional Database Systems
  • Data Lake
  • Big Data, Data Science and Hadoop
  • Cluster Computing and Hadoop Clusters
  • Cluster Computing
  • Hadoop Clusters
  • Hadoop Components and the Hadoop Ecosphere
  • What Do Hadoop Administrators Do?
  • Hadoop Administration—A New Paradigm
  • What You Need to Know to Administer Hadoop
  • The Hadoop Administrator’s Toolset
  • Key Differences between Hadoop 1 and Hadoop 2
  • Architectural Differences
  • High-Availability Features
  • Multiple Processing Engines
  • Separation of Processing and Scheduling
  • Resource Allocation in Hadoop 1 and Hadoop 2
  • Distributed Data Processing: MapReduce and Spark, Hive and Pig
  • MapReduce
  • Apache Spark
  • Apache Hive
  • Apache Pig
  • Data Integration: Apache Sqoop, Apache Flume and
  • Apache Kafka
  • Key Areas of Hadoop Administration
  • Managing the Cluster Storage
  • Allocating the Cluster Resources
  • Scheduling Jobs 29
  • Securing Hadoop Data 30

2. An Introduction to the Architecture of Hadoop:


  • Distributed Computing and Hadoop
  • Hadoop Architecture
  • A Hadoop Cluster
  • Master and Worker Nodes
  • Hadoop Services
  • Data Storage—The Hadoop Distributed File System
  • HDFS Unique Features
  • HDFS Architecture
  • The HDFS File System
  • NameNode Operations
  • Data Processing with YARN, the Hadoop Operating System
  • Architecture of YARN
  • How the ApplicationMaster Works with the
  • ResourceManager to Allocate Resources


3. Creating and Configuring a Simple Hadoop Cluster :
  • Hadoop Distributions and Installation Types
  • Hadoop Distributions
  • Hadoop Installation Types
  • Setting Up a Pseudo-Distributed Hadoop Cluster
  • Meeting the Operating System Requirements
  • Modifying Kernel Parameters
  • Setting Up SSH
  • Java Requirements
  • Installing the Hadoop Software
  • Creating the Necessary Hadoop Users
  • Creating the Necessary Directories
  • Performing the Initial Hadoop Configuration
  • Environment Configuration Files
  • Read-Only Default Configuration Files
  • Site-Specific Configuration Files
  • Other Hadoop-Related Configuration Files
  • Precedence among the Configuration Files
  • Variable Expansion and Configuration
  • Parameters
  • Configuring the Hadoop Daemons Environment
  • Configuring Core Hadoop Properties (with the core-site.xml File)
  • Configuring MapReduce (with the mapred-site.xml File)
  • Configuring YARN (with the yarn-site.xml File)
  • Operating the New Hadoop Cluster
  • Formatting the Distributed File System
  • Setting the Environment Variables
  • Starting the HDFS and YARN Services
  • Verifying the Service Startup
  • Shutting Down the Services

4 .Planning for and Creating a Fully Distributed Cluster :
  • Planning Your Hadoop Cluster
  • General Cluster Planning Considerations
  • Server Form Factors
  • Criteria for Choosing the Nodes
  • Going from a Single Rack to Multiple Racks
  • Sizing a Hadoop Cluster
  • General Principles Governing the Choice of CPU,Memory and Storage
  • Special Treatment for the Master Nodes
  • Recommendations for Sizing the Servers
  • Growing a Cluster
  • Guidelines for Large Clusters
  • Creating a Multinode Cluster
  • How the Test Cluster Is Set Up
  • Modifying the Hadoop Configuration
  • Changing the HDFS Configuration (hdfs-site.xml file)
  • Changing the YARN Configuration
  • Changing the MapReduce Configuration
  • Starting Up the Cluster
  • Starting Up and Shutting Down the Cluster with Scripts
  • Performing a Quick Check of the New Cluster’s File System
  • Configuring Hadoop Services, Web Interfaces and Ports
  • Service Configuration and Web Interfaces
  • Setting Port Numbers for Hadoop Services
  • Hadoop Clients

II. Hadoop Application Frameworks :


5. Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig) :
  • The MapReduce Framework
  • The MapReduce Model
  • How MapReduce Works
  • MapReduce Job Processing
  • A Simple MapReduce Program
  • Understanding Hadoop’s Job Processing—Running a WordCount Program
  • MapReduce Input and Output Directories
  • How Hadoop Shows You the Job Details
  • Hadoop Streaming
  • Apache Hive
  • Hive Data Organization
  • Working with Hive Tables
  • Loading Data into Hive
  • Querying with Hive
  • Apache Pig
  • Pig Execution Modes
  • A Simple Pig Example

6. Running Applications in a Cluster—The Spark Framework: 
  • What Is Spark?
  • Why Spark?
  • Speed
  • Ease of Use and Accessibility
  • General-Purpose Framework
  • Spark and Hadoop
  • The Spark Stack
  • Installing Spark
  • Spark Examples
  • Key Spark Files and Directories
  • Compiling the Spark Binaries
  • Reducing Spark’s Verbosity
  • Spark Run Modes
  • Local Mode
  • Cluster Mode
  • Understanding the Cluster Managers
  • The Standalone Cluster Manager
  • Spark on Apache Mesos
  • Spark on YARN
  • How YARN and Spark Work Together
  • Setting Up Spark on a Hadoop Cluster
  • Spark and Data Access
  • Loading Data from the Linux File System
  • Loading Data from HDFS
  • Loading Data from a Relational Database

7 .Running Spark Applications :
  • The Spark Programming Model
  • Spark Programming and RDDs
  • Programming Spark
  • Spark Applications
  • Basics of RDDs
  • Creating an RDD
  • RDD Operations
  • RDD Persistence
  • Architecture of a Spark Application
  • Spark Terminology
  • Components of a Spark Application
  • Running Spark Applications Interactively
  • Spark Shell and Spark Applications
  • A Bit about the Spark Shell
  • Using the Spark Shell
  • Overview of Spark Cluster Execution
  • Creating and Submitting Spark Applications
  • Building the Spark Application
  • Running an Application in the Standalone Spark Cluster
  • Using spark-submit to Execute Applications
  • Running Spark Applications on Mesos
  • Running Spark Applications in a YARN-Managed Hadoop Cluster
  • Using the JDBC/ODBC Server
  • Configuring Spark Applications
  • Spark Configuration Properties
  • Specifying Configuration when Running spark-submit
  • Monitoring Spark Applications
  • Handling Streaming Data with Spark treaming
  • How Spark Streaming Works
  • A Spark Streaming Example—WordCount Again!
  • Using Spark SQL for Handling Structured Data
  • DataFrames
  • HiveContext and SQLContext
  • Working with Spark SQL
  • Creating DataFrames

III. Managing and Protecting Hadoop Data and High Availability :
8. The Role of the NameNode and How HDFS Works :



  • HDFS—The Interaction between the NameNode and the DataNodes
  • Interaction between the Clients and HDFS
  • NameNode and DataNode Communications
  • Rack Awareness and Topology
  • How to Configure Rack Awareness in Your Cluster
  • Finding Your Cluster’s Rack Information
  • HDFS Data Replication
  • HDFS Data Organization and Data Blocks
  • Data Replication
  • Block and Replica States
  • How Clients Read and Write HDFS Data
  • How Clients Read HDFS Data
  • How Clients Write Data to HDFS
  • Understanding HDFS Recovery Processes
  • Generation Stamp
  • Lease Recovery
  • Block Recovery
  • Pipeline Recovery
  • Centralized Cache Management in HDFS
  • Hadoop and OS Page Caching
  • The Key Principles Behind Centralized Cache Management
  • How Centralized Cache Management Works
  • Configuring Caching
  • Cache Directives
  • Cache Pools
  • Using the Cache
  • Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)
  • Performance Characteristics of Storage Types
  • Changes in the Storage Architecture
  • Storage Preferences for Files
  • Setting Up Archival Storage
  • Managing Storage Policies
  • Moving Data Around
  • Implementing Archival Storage

9. HDFS Commands, HDFS Permissions and HDFS Storage :



  • Managing HDFS through the HDFS Shell Commands
  • Using the hdfs dfs Utility to Manage HDFS
  • Listing HDFS Files and Directories
  • Creating an HDFS Directory
  • Removing HDFS Files and Directories
  • Changing File and Directory Ownership and Groups
  • Using the dfsadmin Utility to Perform HDFS Operations
  • The dfsadmin –report Command
  • Managing HDFS Permissions and Users
  • HDFS File Permissions
  • HDFS Users and Super Users
  • Managing HDFS Storage
  • Checking HDFS Disk Usage
  • Allocating HDFS Space Quotas
  • Rebalancing HDFS Data
  • Reasons for HDFS Data Imbalance
  • Running the Balancer Tool to Balance HDFS Data
  • Using hdfs dfsadmin to Make Things Easier
  • When to Run the Balancer
  • Reclaiming HDFS Space
  • Removing Files and Directories
  • Decreasing the Replication Factor

10 .Data Protection, File Formats and Accessing HDFS :



  • Safeguarding Data
  • Using HDFS Trash to Prevent Accidental Data Deletion
  • Using HDFS Snapshots to Protect Important Data
  • Ensuring Data Integrity with File System Checks
  • Data Compression
  • Common Compression Formats
  • Evaluating the Various Compression Schemes
  • Compression at Various Stages for MapReduce
  • Compression for Spark
  • Data Serialization
  • Hadoop File Formats
  • Criteria for Determining the Right File Format
  • File Formats Supported by Hadoop
  • The Ideal File Format
  • The Hadoop Small Files Problem and Merging Files
  • Using a Federated NameNode to Overcome the Small Files Problem
  • Using Hadoop Archives to Manage Many Small Files
  • Handling the Performance Impact of Small Files
  • Using Hadoop WebHDFS and HttpFS
  • WebHDFS—The Hadoop REST API
  • Using the WebHDFS API
  • Understanding the WebHDFS Commands
  • Using HttpFS Gateway to Access HDFS from Behind a Firewall
  • Summary
  • 11. NameNode Operations, High Availability and Federation :
  • Understanding NameNode Operations
  • HDFS Metadata
  • The NameNode Startup Process
  • How the NameNode and the DataNodes Work Together
  • The Checkpointing Process
  • Secondary, Checkpoint, Backup and Standby Nodes
  • Configuring the Checkpointing Frequency
  • Managing Checkpoint Performance
  • The Mechanics of Checkpointing
  • NameNode Safe Mode Operations
  • Automatic Safe Mode Operations
  • Placing the NameNode in Safe Mode
  • How the NameNode Transitions Through Safe Mode
  • Backing Up and Recovering the NameNode Metadata
  • Configuring HDFS High Availability
  • NameNode HA Architecture (QJM)
  • Setting Up an HDFS HA Quorum Cluster
  • Deploying the High-Availability NameNodes
  • Managing an HA NameNode Setup
  • HA Manual and Automatic Failover
  • HDFS Federation
  • Architecture of a Federated NameNode

IV. Moving Data, Allocating Resources, Scheduling Jobs and Security :
12. Moving Data Into and Out of Hadoop :
  • Introduction to Hadoop Data Transfer Tools
  • Loading Data into HDFS from the Command Line
  • Using the -cat Command to Dump a File’s Contents
  • Testing HDFS Files
  • Copying and Moving Files from and to HDFS
  • Using the -get Command to Move Files
  • Moving Files from and to HDFS
  • Using the -tail and head Commands
  • Copying HDFS Data between Clusters with DistCp
  • How to Use the DistCp Command to Move Data
  • DistCp Options
  • Ingesting Data from Relational Databases with Sqoop
  • Sqoop Architecture
  • Deploying Sqoop
  • Using Sqoop to Move Data
  • Importing Data with Sqoop
  • Importing Data into Hive
  • Exporting Data with Sqoop
  • Ingesting Data from External Sources with Flume
  • Flume Architecture in a Nutshell
  • Configuring the Flume Agent
  • A Simple Flume Example
  • Using Flume to Move Data to HDFS
  • A More Complex Flume Example
  • Ingesting Data with Kafka
  • Benefits Offered by Kafka
  • How Kafka Works
  • Setting Up an Apache Kafka Cluster
  • Integrating Kafka with Hadoop and Storm
  • Summary
  • 13. Resource Allocation in a Hadoop Cluster :
  • Resource Allocation in Hadoop
  • Managing Cluster Workloads
  • Hadoop’s Resource Schedulers
  • The FIFO Scheduler
  • The Capacity Scheduler
  • Queues and Subqueues
  • How the Cluster Allocates Resources
  • Preempting Applications
  • Enabling the Capacity Scheduler
  • A Typical Capacity Scheduler
  • The Fair Scheduler
  • Queues
  • Configuring the Fair Scheduler
  • How Jobs Are Placed into Queues
  • Application Preemption in the Fair Scheduler
  • Security and Resource Pools
  • A Sample fair-scheduler.xml File
  • Submitting Jobs to the Scheduler
  • Moving Applications between Queues
  • Monitoring the Fair Scheduler
  • Comparing the Capacity Scheduler and the Fair Scheduler
  • Similarities between the Two Schedulers
  • Differences between the Two Schedulers

14. Working with Oozie to Manage Job Workflows :
  • Using Apache Oozie to Schedule Jobs
  • Oozie Architecture
  • The Oozie Server
  • The Oozie Client
  • The Oozie Database
  • Deploying Oozie in Your Cluster
  • Installing and Configuring Oozie
  • Configuring Hadoop for Oozie
  • Understanding Oozie Workflows
  • Workflows, Control Flow, and Nodes
  • Defining the Workflows with the workflow.xml File
  • How Oozie Runs an Action
  • Configuring the Action Nodes
  • Creating an Oozie Workflow
  • Configuring the Control Nodes
  • Configuring the Job
  • Running an Oozie Workflow Job
  • Specifying the Job Properties
  • Deploying Oozie Jobs
  • Creating Dynamic Workflows
  • Oozie Coordinators
  • Time-Based Coordinators
  • Data-Based Coordinators
  • Time-and-Data-Based Coordinators
  • Submitting the Oozie Coordinator from the Command Line
  • Managing and Administering Oozie
  • Common Oozie Commands and How to Run Them
  • Troubleshooting Oozie
  • Oozie cron Scheduling and Oozie Service Level Agreements

15. Securing Hadoop :
  • Hadoop Security—An Overview
  • Authentication, Authorization and Accounting
  • Hadoop Authentication with Kerberos
  • Kerberos and How It Works
  • The Kerberos Authentication Process
  • Kerberos Trusts
  • A Special Principal
  • Adding Kerberos Authorization to your Cluster
  • Setting Up Kerberos for Hadoop
  • Securing a Hadoop Cluster with Kerberos
  • How Kerberos Authenticates Users and Services
  • Managing a Kerberized Hadoop Cluster
  • Hadoop Authorization
  • HDFS Permissions
  • Service Level Authorization
  • Role-Based Authorization with Apache Sentry
  • Auditing Hadoop
  • Auditing HDFS Operations
  • Auditing YARN Operations
  • Securing Hadoop Data
  • HDFS Transparent Encryption
  • Encrypting Data in Transition
  • Other Hadoop-Related Security Initiatives
  • Securing a Hadoop Infrastructure with Apache Knox Gateway
  • Apache Ranger for Security Administration

V. Monitoring, Optimization and Troubleshooting:
16. Managing Jobs, Using Hue and Performing Routine Tasks:
  • Using the YARN Commands to Manage Hadoop Jobs
  • Viewing YARN Applications
  • Checking the Status of an Application
  • Killing a Running Application
  • Checking the Status of the Nodes
  • Checking YARN Queues
  • Getting the Application Logs
  • Yarn Administrative Commands
  • Decommissioning and Recommissioning Nodes
  • Including and Excluding Hosts
  • Decommissioning DataNodes and NodeManagers
  • Recommissioning Nodes
  • Things to Remember about Decommissioning and Recommissioning
  • Adding a New DataNode and/or a NodeManager
  • ResourceManager High Availability
  • ResourceManager High-Availability Architecture
  • Setting Up ResourceManager High Availability
  • ResourceManager Failover
  • Using the ResourceManager High-Availability Commands
  • Performing Common Management Tasks
  • Moving the NameNode to a Different Host
  • Managing High-Availability NameNodes
  • Using a Shutdown/Startup Script to Manage your Cluster
  • Balancing HDFS
  • Balancing the Storage on the DataNodes
  • Managing the MySQL Database
  • Configuring a MySQL Database
  • Configuring MySQL High Availability
  • Backing Up Important Cluster Data
  • Backing Up HDFS Metadata
  • Backing Up the Metastore Databases
  • Using Hue to Administer Your Cluster
  • Allowing Your Users to Use Hue
  • Installing Hue
  • Configuring Your Cluster to Work with Hue
  • Managing Hue
  • Working with Hue
  • Implementing Specialized HDFS Features
  • Deploying HDFS and YARN in a Multihomed Network
  • Short-Circuit Local Reads
  • Mountable HDFS
  • Using an NFS Gateway for Mounting HDFS to a Local File System

17. Monitoring, Metrics and Hadoop Logging :
  • Monitoring Linux Servers
  • Basics of Linux System Monitoring
  • Monitoring Tools for Linux Systems
  • Hadoop Metrics
  • Hadoop Metric Types
  • Using the Hadoop Metrics
  • Capturing Metrics to a File System
  • Using Ganglia for Monitoring
  • Ganglia Architecture
  • Setting Up the Ganglia and Hadoop Integration
  • Setting Up the Hadoop Metrics
  • Understanding Hadoop Logging
  • Hadoop Log Messages
  • Daemon and Application Logs and How to View
  • Them
  • How Application Logging Works
  • How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run
  • How the NodeManager Uses the Local Directories
  • Storing Job Logs in HDFS through Log Aggregation
  • Working with the Hadoop Daemon Logs
  • Using Hadoop’s Web UIs for Monitoring
  • Monitoring Jobs with the ResourceManager Web UI
  • The JobHistoryServer Web UI
  • Monitoring with the NameNode Web UI
  • Monitoring Other Hadoop Components
  • Monitoring Hive
  • Monitoring Spark


18. Tuning the Cluster Resources, Optimizing MapReduce Jobs andBenchmarking :
  • How to Allocate YARN Memory and CPU
  • Allocating Memory
  • Configuring the Number of CPU Cores
  • Relationship between Memory and CPU Vcores
  • Configuring Efficient Performance
  • Speculative Execution
  • Reducing the I/O Load on the System
  • Tuning Map and Reduce Tasks—What the Administrator Can Do
  • Tuning the Map Tasks
  • Input and Output
  • Tuning the Reduce Tasks
  • Tuning the MapReduce Shuffle Process
  • Optimizing Pig and Hive Jobs
  • Optimizing Hive Jobs
  • Optimizing Pig Jobs
  • Benchmarking Your cluster
  • Using TestDFSIO for Testing I/O Performance
  • Benchmarking with TeraSort
  • Using Hadoop’s Rumen and GridMix for Benchmarking
  • Hadoop Counters
  • File System Counters
  • Job Counters
  • MapReduce Framework Counters
  • Custom Java Counters
  • Limiting the Number of Counters
  • Optimizing MapReduce
  • Map-Only versus Map and Reduce Jobs
  • How Combiners Improve MapReduce Performance
  • Using a Partitioner to Improve Performance
  • Compressing Data During the MapReduce Process
  • Too Many Mappers or Reducers?

19. Configuring and Tuning Apache Spark on YARN :
  • Configuring Resource Allocation for Spark on YARN
  • Allocating CPU
  • Allocating Memory
  • How Resources are Allocated to Spark
  • Limits on the Resource Allocation to Spark Applications
  • Allocating Resources to the Driver
  • Configuring Resources for the Executors
  • How Spark Uses its Memory
  • Things to Remember
  • Cluster or Client Mode?
  • Configuring Spark-Related Network Parameters
  • Dynamic Resource Allocation when Running Spark on YARN
  • Dynamic and Static Resource Allocation
  • How Spark Manages Dynamic Resource
  • Allocation
  • Enabling Dynamic Resource Allocation
  • Storage Formats and Compressing Data
  • Storage Formats
  • File Sizes
  • Compression
  • Monitoring Spark Applications
  • Using the Spark Web UI to Understand Performance
  • Spark System and the Metrics REST API
  • The Spark History Server on YARN
  • Tracking Jobs from the Command Line
  • Tuning Garbage Collection
  • The Mechanics of Garbage Collection
  • How to Collect GC Statistics
  • Tuning Spark Streaming Applications
  • Reducing Batch Processing Time
  • Setting the Right Batch Interval
  • Tuning Memory and Garbage Collection

20 .Optimizing Spark Applications :
  • Revisiting the Spark Execution Model
  • The Spark Execution Model
  • Shuffle Operations and How to Minimize Them
  • A WordCount Example to Our Rescue Again
  • Impact of a Shuffle Operation
  • Configuring the Shuffle Parameters
  • Partitioning and Parallelism (Number of Tasks)
  • Level of Parallelism
  • Problems with Too Few Tasks
  • Setting the Default Number of Partitions
  • How to Increase the Number of Partitions
  • Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD
  • Two Types of Partitioners
  • Data Partitioning and How It Can Avoid a Shuffle
  • Optimizing Data Serialization and Compression
  • Data Serialization
  • Configuring Compression
  • Understanding Spark’s SQL Query Optimizer
  • Understanding the Optimizer Steps
  • Spark’s Speculative Execution Feature
  • The Importance of Data Locality
  • Caching Data
  • Fault-Tolerance Due to Caching
  • How to Specify Caching

21. Troubleshooting Hadoop—A Sampler :
  • Space-Related Issues
  • Dealing with a 100 Percent Full Linux File System
  • HDFS Space Issues
  • Local and Log Directories Out of Free Space
  • Disk Volume Failure Toleration
  • Handling YARN Jobs That Are Stuck
  • JVM Memory-Allocation and Garbage-Collection Strategies
  • Understanding JVM Garbage Collection
  • Optimizing Garbage Collection
  • Analyzing Memory Usage
  • Out of Memory Errors
  • ApplicationMaster Memory Issues
  • Handling Different Types of Failures
  • Handling Daemon Failures
  • Starting Failures for Hadoop Daemons
  • Task and Job Failures
  • Troubleshooting Spark Jobs
  • Spark’s Fault Tolerance Mechanism
  • Killing Spark Jobs
  • Maximum Attempts for a Job
  • Maximum Failures per Job
  • Debugging Spark Applications
  • Viewing Logs with Log Aggregation
  • Viewing Logs When Log Aggregation Is Not Enabled
  • Reviewing the Launch Environment

22 .Installing VirtualBox and Linux and Cloning the Virtual Machines :
  • Installing Oracle VirtualBox
  • Installing Oracle Enterprise Linux
  • Cloning the Linux Server

No comments:

Post a Comment

Hyderabad Trip - Best Places to visit

 Best Places to Visit  in Hyderabad 1.        1. Golconda Fort Maps Link :   https://www.google.com/maps/dir/Aparna+Serene+Park,+Masj...