Keep on Keeping on !!: Big Data Admin

I. Introduction to Hadoop—Architecture and Hadoop Clusters :

1. Introduction to Hadoop and Its Environment

Hadoop—An Introduction
Unique Features of Hadoop
Big Data and Hadoop
A Typical Scenario for Using Hadoop
Traditional Database Systems
Data Lake
Big Data, Data Science and Hadoop
Cluster Computing and Hadoop Clusters
Cluster Computing
Hadoop Clusters
Hadoop Components and the Hadoop Ecosphere
What Do Hadoop Administrators Do?
Hadoop Administration—A New Paradigm
What You Need to Know to Administer Hadoop
The Hadoop Administrator’s Toolset
Key Differences between Hadoop 1 and Hadoop 2
Architectural Differences
High-Availability Features
Multiple Processing Engines
Separation of Processing and Scheduling
Resource Allocation in Hadoop 1 and Hadoop 2
Distributed Data Processing: MapReduce and Spark, Hive and Pig
MapReduce
Apache Spark
Apache Hive
Apache Pig
Data Integration: Apache Sqoop, Apache Flume and
Apache Kafka
Key Areas of Hadoop Administration
Managing the Cluster Storage
Allocating the Cluster Resources
Scheduling Jobs 29
Securing Hadoop Data 30

2. An Introduction to the Architecture of Hadoop:

Distributed Computing and Hadoop
Hadoop Architecture
A Hadoop Cluster
Master and Worker Nodes
Hadoop Services
Data Storage—The Hadoop Distributed File System
HDFS Unique Features
HDFS Architecture
The HDFS File System
NameNode Operations
Data Processing with YARN, the Hadoop Operating System
Architecture of YARN
How the ApplicationMaster Works with the
ResourceManager to Allocate Resources

3. Creating and Configuring a Simple Hadoop Cluster :

Hadoop Distributions and Installation Types
Hadoop Distributions
Hadoop Installation Types
Setting Up a Pseudo-Distributed Hadoop Cluster
Meeting the Operating System Requirements
Modifying Kernel Parameters
Setting Up SSH
Java Requirements
Installing the Hadoop Software
Creating the Necessary Hadoop Users
Creating the Necessary Directories
Performing the Initial Hadoop Configuration
Environment Configuration Files
Read-Only Default Configuration Files
Site-Specific Configuration Files
Other Hadoop-Related Configuration Files
Precedence among the Configuration Files
Variable Expansion and Configuration
Parameters
Configuring the Hadoop Daemons Environment
Configuring Core Hadoop Properties (with the core-site.xml File)
Configuring MapReduce (with the mapred-site.xml File)
Configuring YARN (with the yarn-site.xml File)
Operating the New Hadoop Cluster
Formatting the Distributed File System
Setting the Environment Variables
Starting the HDFS and YARN Services
Verifying the Service Startup
Shutting Down the Services

4 .Planning for and Creating a Fully Distributed Cluster :

Planning Your Hadoop Cluster
General Cluster Planning Considerations
Server Form Factors
Criteria for Choosing the Nodes
Going from a Single Rack to Multiple Racks
Sizing a Hadoop Cluster
General Principles Governing the Choice of CPU,Memory and Storage
Special Treatment for the Master Nodes
Recommendations for Sizing the Servers
Growing a Cluster
Guidelines for Large Clusters
Creating a Multinode Cluster
How the Test Cluster Is Set Up
Modifying the Hadoop Configuration
Changing the HDFS Configuration (hdfs-site.xml file)
Changing the YARN Configuration
Changing the MapReduce Configuration
Starting Up the Cluster
Starting Up and Shutting Down the Cluster with Scripts
Performing a Quick Check of the New Cluster’s File System
Configuring Hadoop Services, Web Interfaces and Ports
Service Configuration and Web Interfaces
Setting Port Numbers for Hadoop Services
Hadoop Clients

II. Hadoop Application Frameworks :

5. Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig) :

The MapReduce Framework
The MapReduce Model
How MapReduce Works
MapReduce Job Processing
A Simple MapReduce Program
Understanding Hadoop’s Job Processing—Running a WordCount Program
MapReduce Input and Output Directories
How Hadoop Shows You the Job Details
Hadoop Streaming
Apache Hive
Hive Data Organization
Working with Hive Tables
Loading Data into Hive
Querying with Hive
Apache Pig
Pig Execution Modes
A Simple Pig Example

6. Running Applications in a Cluster—The Spark Framework:

What Is Spark?
Why Spark?
Speed
Ease of Use and Accessibility
General-Purpose Framework
Spark and Hadoop
The Spark Stack
Installing Spark
Spark Examples
Key Spark Files and Directories
Compiling the Spark Binaries
Reducing Spark’s Verbosity
Spark Run Modes
Local Mode
Cluster Mode
Understanding the Cluster Managers
The Standalone Cluster Manager
Spark on Apache Mesos
Spark on YARN
How YARN and Spark Work Together
Setting Up Spark on a Hadoop Cluster
Spark and Data Access
Loading Data from the Linux File System
Loading Data from HDFS
Loading Data from a Relational Database

7 .Running Spark Applications :

The Spark Programming Model
Spark Programming and RDDs
Programming Spark
Spark Applications
Basics of RDDs
Creating an RDD
RDD Operations
RDD Persistence
Architecture of a Spark Application
Spark Terminology
Components of a Spark Application
Running Spark Applications Interactively
Spark Shell and Spark Applications
A Bit about the Spark Shell
Using the Spark Shell
Overview of Spark Cluster Execution
Creating and Submitting Spark Applications
Building the Spark Application
Running an Application in the Standalone Spark Cluster
Using spark-submit to Execute Applications
Running Spark Applications on Mesos
Running Spark Applications in a YARN-Managed Hadoop Cluster
Using the JDBC/ODBC Server
Configuring Spark Applications
Spark Configuration Properties
Specifying Configuration when Running spark-submit
Monitoring Spark Applications
Handling Streaming Data with Spark treaming
How Spark Streaming Works
A Spark Streaming Example—WordCount Again!
Using Spark SQL for Handling Structured Data
DataFrames
HiveContext and SQLContext
Working with Spark SQL
Creating DataFrames

III. Managing and Protecting Hadoop Data and High Availability :
8. The Role of the NameNode and How HDFS Works :

HDFS—The Interaction between the NameNode and the DataNodes
Interaction between the Clients and HDFS
NameNode and DataNode Communications
Rack Awareness and Topology
How to Configure Rack Awareness in Your Cluster
Finding Your Cluster’s Rack Information
HDFS Data Replication
HDFS Data Organization and Data Blocks
Data Replication
Block and Replica States
How Clients Read and Write HDFS Data
How Clients Read HDFS Data
How Clients Write Data to HDFS
Understanding HDFS Recovery Processes
Generation Stamp
Lease Recovery
Block Recovery
Pipeline Recovery
Centralized Cache Management in HDFS
Hadoop and OS Page Caching
The Key Principles Behind Centralized Cache Management
How Centralized Cache Management Works
Configuring Caching
Cache Directives
Cache Pools
Using the Cache
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)
Performance Characteristics of Storage Types
Changes in the Storage Architecture
Storage Preferences for Files
Setting Up Archival Storage
Managing Storage Policies
Moving Data Around
Implementing Archival Storage

9. HDFS Commands, HDFS Permissions and HDFS Storage :

Managing HDFS through the HDFS Shell Commands
Using the hdfs dfs Utility to Manage HDFS
Listing HDFS Files and Directories
Creating an HDFS Directory
Removing HDFS Files and Directories
Changing File and Directory Ownership and Groups
Using the dfsadmin Utility to Perform HDFS Operations
The dfsadmin –report Command
Managing HDFS Permissions and Users
HDFS File Permissions
HDFS Users and Super Users
Managing HDFS Storage
Checking HDFS Disk Usage
Allocating HDFS Space Quotas
Rebalancing HDFS Data
Reasons for HDFS Data Imbalance
Running the Balancer Tool to Balance HDFS Data
Using hdfs dfsadmin to Make Things Easier
When to Run the Balancer
Reclaiming HDFS Space
Removing Files and Directories
Decreasing the Replication Factor

10 .Data Protection, File Formats and Accessing HDFS :

Safeguarding Data
Using HDFS Trash to Prevent Accidental Data Deletion
Using HDFS Snapshots to Protect Important Data
Ensuring Data Integrity with File System Checks
Data Compression
Common Compression Formats
Evaluating the Various Compression Schemes
Compression at Various Stages for MapReduce
Compression for Spark
Data Serialization
Hadoop File Formats
Criteria for Determining the Right File Format
File Formats Supported by Hadoop
The Ideal File Format
The Hadoop Small Files Problem and Merging Files
Using a Federated NameNode to Overcome the Small Files Problem
Using Hadoop Archives to Manage Many Small Files
Handling the Performance Impact of Small Files
Using Hadoop WebHDFS and HttpFS
WebHDFS—The Hadoop REST API
Using the WebHDFS API
Understanding the WebHDFS Commands
Using HttpFS Gateway to Access HDFS from Behind a Firewall
Summary
11. NameNode Operations, High Availability and Federation :
Understanding NameNode Operations
HDFS Metadata
The NameNode Startup Process
How the NameNode and the DataNodes Work Together
The Checkpointing Process
Secondary, Checkpoint, Backup and Standby Nodes
Configuring the Checkpointing Frequency
Managing Checkpoint Performance
The Mechanics of Checkpointing
NameNode Safe Mode Operations
Automatic Safe Mode Operations
Placing the NameNode in Safe Mode
How the NameNode Transitions Through Safe Mode
Backing Up and Recovering the NameNode Metadata
Configuring HDFS High Availability
NameNode HA Architecture (QJM)
Setting Up an HDFS HA Quorum Cluster
Deploying the High-Availability NameNodes
Managing an HA NameNode Setup
HA Manual and Automatic Failover
HDFS Federation
Architecture of a Federated NameNode

IV. Moving Data, Allocating Resources, Scheduling Jobs and Security :
12. Moving Data Into and Out of Hadoop :

Introduction to Hadoop Data Transfer Tools
Loading Data into HDFS from the Command Line
Using the -cat Command to Dump a File’s Contents
Testing HDFS Files
Copying and Moving Files from and to HDFS
Using the -get Command to Move Files
Moving Files from and to HDFS
Using the -tail and head Commands
Copying HDFS Data between Clusters with DistCp
How to Use the DistCp Command to Move Data
DistCp Options
Ingesting Data from Relational Databases with Sqoop
Sqoop Architecture
Deploying Sqoop
Using Sqoop to Move Data
Importing Data with Sqoop
Importing Data into Hive
Exporting Data with Sqoop
Ingesting Data from External Sources with Flume
Flume Architecture in a Nutshell
Configuring the Flume Agent
A Simple Flume Example
Using Flume to Move Data to HDFS
A More Complex Flume Example
Ingesting Data with Kafka
Benefits Offered by Kafka
How Kafka Works
Setting Up an Apache Kafka Cluster
Integrating Kafka with Hadoop and Storm
Summary
13. Resource Allocation in a Hadoop Cluster :
Resource Allocation in Hadoop
Managing Cluster Workloads
Hadoop’s Resource Schedulers
The FIFO Scheduler
The Capacity Scheduler
Queues and Subqueues
How the Cluster Allocates Resources
Preempting Applications
Enabling the Capacity Scheduler
A Typical Capacity Scheduler
The Fair Scheduler
Queues
Configuring the Fair Scheduler
How Jobs Are Placed into Queues
Application Preemption in the Fair Scheduler
Security and Resource Pools
A Sample fair-scheduler.xml File
Submitting Jobs to the Scheduler
Moving Applications between Queues
Monitoring the Fair Scheduler
Comparing the Capacity Scheduler and the Fair Scheduler
Similarities between the Two Schedulers
Differences between the Two Schedulers

14. Working with Oozie to Manage Job Workflows :

Using Apache Oozie to Schedule Jobs
Oozie Architecture
The Oozie Server
The Oozie Client
The Oozie Database
Deploying Oozie in Your Cluster
Installing and Configuring Oozie
Configuring Hadoop for Oozie
Understanding Oozie Workflows
Workflows, Control Flow, and Nodes
Defining the Workflows with the workflow.xml File
How Oozie Runs an Action
Configuring the Action Nodes
Creating an Oozie Workflow
Configuring the Control Nodes
Configuring the Job
Running an Oozie Workflow Job
Specifying the Job Properties
Deploying Oozie Jobs
Creating Dynamic Workflows
Oozie Coordinators
Time-Based Coordinators
Data-Based Coordinators
Time-and-Data-Based Coordinators
Submitting the Oozie Coordinator from the Command Line
Managing and Administering Oozie
Common Oozie Commands and How to Run Them
Troubleshooting Oozie
Oozie cron Scheduling and Oozie Service Level Agreements

15. Securing Hadoop :

Hadoop Security—An Overview
Authentication, Authorization and Accounting
Hadoop Authentication with Kerberos
Kerberos and How It Works
The Kerberos Authentication Process
Kerberos Trusts
A Special Principal
Adding Kerberos Authorization to your Cluster
Setting Up Kerberos for Hadoop
Securing a Hadoop Cluster with Kerberos
How Kerberos Authenticates Users and Services
Managing a Kerberized Hadoop Cluster
Hadoop Authorization
HDFS Permissions
Service Level Authorization
Role-Based Authorization with Apache Sentry
Auditing Hadoop
Auditing HDFS Operations
Auditing YARN Operations
Securing Hadoop Data
HDFS Transparent Encryption
Encrypting Data in Transition
Other Hadoop-Related Security Initiatives
Securing a Hadoop Infrastructure with Apache Knox Gateway
Apache Ranger for Security Administration

V. Monitoring, Optimization and Troubleshooting:
16. Managing Jobs, Using Hue and Performing Routine Tasks:

Using the YARN Commands to Manage Hadoop Jobs
Viewing YARN Applications
Checking the Status of an Application
Killing a Running Application
Checking the Status of the Nodes
Checking YARN Queues
Getting the Application Logs
Yarn Administrative Commands
Decommissioning and Recommissioning Nodes
Including and Excluding Hosts
Decommissioning DataNodes and NodeManagers
Recommissioning Nodes
Things to Remember about Decommissioning and Recommissioning
Adding a New DataNode and/or a NodeManager
ResourceManager High Availability
ResourceManager High-Availability Architecture
Setting Up ResourceManager High Availability
ResourceManager Failover
Using the ResourceManager High-Availability Commands
Performing Common Management Tasks
Moving the NameNode to a Different Host
Managing High-Availability NameNodes
Using a Shutdown/Startup Script to Manage your Cluster
Balancing HDFS
Balancing the Storage on the DataNodes
Managing the MySQL Database
Configuring a MySQL Database
Configuring MySQL High Availability
Backing Up Important Cluster Data
Backing Up HDFS Metadata
Backing Up the Metastore Databases
Using Hue to Administer Your Cluster
Allowing Your Users to Use Hue
Installing Hue
Configuring Your Cluster to Work with Hue
Managing Hue
Working with Hue
Implementing Specialized HDFS Features
Deploying HDFS and YARN in a Multihomed Network
Short-Circuit Local Reads
Mountable HDFS
Using an NFS Gateway for Mounting HDFS to a Local File System

17. Monitoring, Metrics and Hadoop Logging :

Monitoring Linux Servers
Basics of Linux System Monitoring
Monitoring Tools for Linux Systems
Hadoop Metrics
Hadoop Metric Types
Using the Hadoop Metrics
Capturing Metrics to a File System
Using Ganglia for Monitoring
Ganglia Architecture
Setting Up the Ganglia and Hadoop Integration
Setting Up the Hadoop Metrics
Understanding Hadoop Logging
Hadoop Log Messages
Daemon and Application Logs and How to View
Them
How Application Logging Works
How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run
How the NodeManager Uses the Local Directories
Storing Job Logs in HDFS through Log Aggregation
Working with the Hadoop Daemon Logs
Using Hadoop’s Web UIs for Monitoring
Monitoring Jobs with the ResourceManager Web UI
The JobHistoryServer Web UI
Monitoring with the NameNode Web UI
Monitoring Other Hadoop Components
Monitoring Hive
Monitoring Spark

18. Tuning the Cluster Resources, Optimizing MapReduce Jobs andBenchmarking :

How to Allocate YARN Memory and CPU
Allocating Memory
Configuring the Number of CPU Cores
Relationship between Memory and CPU Vcores
Configuring Efficient Performance
Speculative Execution
Reducing the I/O Load on the System
Tuning Map and Reduce Tasks—What the Administrator Can Do
Tuning the Map Tasks
Input and Output
Tuning the Reduce Tasks
Tuning the MapReduce Shuffle Process
Optimizing Pig and Hive Jobs
Optimizing Hive Jobs
Optimizing Pig Jobs
Benchmarking Your cluster
Using TestDFSIO for Testing I/O Performance
Benchmarking with TeraSort
Using Hadoop’s Rumen and GridMix for Benchmarking
Hadoop Counters
File System Counters
Job Counters
MapReduce Framework Counters
Custom Java Counters
Limiting the Number of Counters
Optimizing MapReduce
Map-Only versus Map and Reduce Jobs
How Combiners Improve MapReduce Performance
Using a Partitioner to Improve Performance
Compressing Data During the MapReduce Process
Too Many Mappers or Reducers?

19. Configuring and Tuning Apache Spark on YARN :

Configuring Resource Allocation for Spark on YARN
Allocating CPU
Allocating Memory
How Resources are Allocated to Spark
Limits on the Resource Allocation to Spark Applications
Allocating Resources to the Driver
Configuring Resources for the Executors
How Spark Uses its Memory
Things to Remember
Cluster or Client Mode?
Configuring Spark-Related Network Parameters
Dynamic Resource Allocation when Running Spark on YARN
Dynamic and Static Resource Allocation
How Spark Manages Dynamic Resource
Allocation
Enabling Dynamic Resource Allocation
Storage Formats and Compressing Data
Storage Formats
File Sizes
Compression
Monitoring Spark Applications
Using the Spark Web UI to Understand Performance
Spark System and the Metrics REST API
The Spark History Server on YARN
Tracking Jobs from the Command Line
Tuning Garbage Collection
The Mechanics of Garbage Collection
How to Collect GC Statistics
Tuning Spark Streaming Applications
Reducing Batch Processing Time
Setting the Right Batch Interval
Tuning Memory and Garbage Collection

20 .Optimizing Spark Applications :

Revisiting the Spark Execution Model
The Spark Execution Model
Shuffle Operations and How to Minimize Them
A WordCount Example to Our Rescue Again
Impact of a Shuffle Operation
Configuring the Shuffle Parameters
Partitioning and Parallelism (Number of Tasks)
Level of Parallelism
Problems with Too Few Tasks
Setting the Default Number of Partitions
How to Increase the Number of Partitions
Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD
Two Types of Partitioners
Data Partitioning and How It Can Avoid a Shuffle
Optimizing Data Serialization and Compression
Data Serialization
Configuring Compression
Understanding Spark’s SQL Query Optimizer
Understanding the Optimizer Steps
Spark’s Speculative Execution Feature
The Importance of Data Locality
Caching Data
Fault-Tolerance Due to Caching
How to Specify Caching

21. Troubleshooting Hadoop—A Sampler :

Space-Related Issues
Dealing with a 100 Percent Full Linux File System
HDFS Space Issues
Local and Log Directories Out of Free Space
Disk Volume Failure Toleration
Handling YARN Jobs That Are Stuck
JVM Memory-Allocation and Garbage-Collection Strategies
Understanding JVM Garbage Collection
Optimizing Garbage Collection
Analyzing Memory Usage
Out of Memory Errors
ApplicationMaster Memory Issues
Handling Different Types of Failures
Handling Daemon Failures
Starting Failures for Hadoop Daemons
Task and Job Failures
Troubleshooting Spark Jobs
Spark’s Fault Tolerance Mechanism
Killing Spark Jobs
Maximum Attempts for a Job
Maximum Failures per Job
Debugging Spark Applications
Viewing Logs with Log Aggregation
Viewing Logs When Log Aggregation Is Not Enabled
Reviewing the Launch Environment

22 .Installing VirtualBox and Linux and Cloning the Virtual Machines :

Installing Oracle VirtualBox
Installing Oracle Enterprise Linux
Cloning the Linux Server

Keep on Keeping on !!

Saturday, September 22, 2018

Big Data Admin

No comments:

Post a Comment

Hyderabad Trip - Best Places to visit

Followers

Pages