I. Introduction to Hadoop—Architecture and Hadoop Clusters :
1. Introduction to Hadoop and Its Environment
- Hadoop—An Introduction
- Unique Features of Hadoop
- Big Data and Hadoop
- A Typical Scenario for Using Hadoop
- Traditional Database Systems
- Data Lake
- Big Data, Data Science and Hadoop
- Cluster Computing and Hadoop Clusters
- Cluster Computing
- Hadoop Clusters
- Hadoop Components and the Hadoop Ecosphere
- What Do Hadoop Administrators Do?
- Hadoop Administration—A New Paradigm
- What You Need to Know to Administer Hadoop
- The Hadoop Administrator’s Toolset
- Key Differences between Hadoop 1 and Hadoop 2
- Architectural Differences
- High-Availability Features
- Multiple Processing Engines
- Separation of Processing and Scheduling
- Resource Allocation in Hadoop 1 and Hadoop 2
- Distributed Data Processing: MapReduce and Spark, Hive and Pig
- MapReduce
- Apache Spark
- Apache Hive
- Apache Pig
- Data Integration: Apache Sqoop, Apache Flume and
- Apache Kafka
- Key Areas of Hadoop Administration
- Managing the Cluster Storage
- Allocating the Cluster Resources
- Scheduling Jobs 29
- Securing Hadoop Data 30
2. An Introduction to the Architecture of Hadoop:
- Distributed Computing and Hadoop
- Hadoop Architecture
- A Hadoop Cluster
- Master and Worker Nodes
- Hadoop Services
- Data Storage—The Hadoop Distributed File System
- HDFS Unique Features
- HDFS Architecture
- The HDFS File System
- NameNode Operations
- Data Processing with YARN, the Hadoop Operating System
- Architecture of YARN
- How the ApplicationMaster Works with the
- ResourceManager to Allocate Resources
3. Creating and Configuring a Simple Hadoop Cluster :
- Hadoop Distributions and Installation Types
- Hadoop Distributions
- Hadoop Installation Types
- Setting Up a Pseudo-Distributed Hadoop Cluster
- Meeting the Operating System Requirements
- Modifying Kernel Parameters
- Setting Up SSH
- Java Requirements
- Installing the Hadoop Software
- Creating the Necessary Hadoop Users
- Creating the Necessary Directories
- Performing the Initial Hadoop Configuration
- Environment Configuration Files
- Read-Only Default Configuration Files
- Site-Specific Configuration Files
- Other Hadoop-Related Configuration Files
- Precedence among the Configuration Files
- Variable Expansion and Configuration
- Parameters
- Configuring the Hadoop Daemons Environment
- Configuring Core Hadoop Properties (with the core-site.xml File)
- Configuring MapReduce (with the mapred-site.xml File)
- Configuring YARN (with the yarn-site.xml File)
- Operating the New Hadoop Cluster
- Formatting the Distributed File System
- Setting the Environment Variables
- Starting the HDFS and YARN Services
- Verifying the Service Startup
- Shutting Down the Services
4 .Planning for and Creating a Fully Distributed Cluster :
- Planning Your Hadoop Cluster
- General Cluster Planning Considerations
- Server Form Factors
- Criteria for Choosing the Nodes
- Going from a Single Rack to Multiple Racks
- Sizing a Hadoop Cluster
- General Principles Governing the Choice of CPU,Memory and Storage
- Special Treatment for the Master Nodes
- Recommendations for Sizing the Servers
- Growing a Cluster
- Guidelines for Large Clusters
- Creating a Multinode Cluster
- How the Test Cluster Is Set Up
- Modifying the Hadoop Configuration
- Changing the HDFS Configuration (hdfs-site.xml file)
- Changing the YARN Configuration
- Changing the MapReduce Configuration
- Starting Up the Cluster
- Starting Up and Shutting Down the Cluster with Scripts
- Performing a Quick Check of the New Cluster’s File System
- Configuring Hadoop Services, Web Interfaces and Ports
- Service Configuration and Web Interfaces
- Setting Port Numbers for Hadoop Services
- Hadoop Clients
II. Hadoop Application Frameworks :
5. Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig) :
- The MapReduce Framework
- The MapReduce Model
- How MapReduce Works
- MapReduce Job Processing
- A Simple MapReduce Program
- Understanding Hadoop’s Job Processing—Running a WordCount Program
- MapReduce Input and Output Directories
- How Hadoop Shows You the Job Details
- Hadoop Streaming
- Apache Hive
- Hive Data Organization
- Working with Hive Tables
- Loading Data into Hive
- Querying with Hive
- Apache Pig
- Pig Execution Modes
- A Simple Pig Example
6. Running Applications in a Cluster—The Spark Framework:
- What Is Spark?
- Why Spark?
- Speed
- Ease of Use and Accessibility
- General-Purpose Framework
- Spark and Hadoop
- The Spark Stack
- Installing Spark
- Spark Examples
- Key Spark Files and Directories
- Compiling the Spark Binaries
- Reducing Spark’s Verbosity
- Spark Run Modes
- Local Mode
- Cluster Mode
- Understanding the Cluster Managers
- The Standalone Cluster Manager
- Spark on Apache Mesos
- Spark on YARN
- How YARN and Spark Work Together
- Setting Up Spark on a Hadoop Cluster
- Spark and Data Access
- Loading Data from the Linux File System
- Loading Data from HDFS
- Loading Data from a Relational Database
7 .Running Spark Applications :
- The Spark Programming Model
- Spark Programming and RDDs
- Programming Spark
- Spark Applications
- Basics of RDDs
- Creating an RDD
- RDD Operations
- RDD Persistence
- Architecture of a Spark Application
- Spark Terminology
- Components of a Spark Application
- Running Spark Applications Interactively
- Spark Shell and Spark Applications
- A Bit about the Spark Shell
- Using the Spark Shell
- Overview of Spark Cluster Execution
- Creating and Submitting Spark Applications
- Building the Spark Application
- Running an Application in the Standalone Spark Cluster
- Using spark-submit to Execute Applications
- Running Spark Applications on Mesos
- Running Spark Applications in a YARN-Managed Hadoop Cluster
- Using the JDBC/ODBC Server
- Configuring Spark Applications
- Spark Configuration Properties
- Specifying Configuration when Running spark-submit
- Monitoring Spark Applications
- Handling Streaming Data with Spark treaming
- How Spark Streaming Works
- A Spark Streaming Example—WordCount Again!
- Using Spark SQL for Handling Structured Data
- DataFrames
- HiveContext and SQLContext
- Working with Spark SQL
- Creating DataFrames
III. Managing and Protecting Hadoop Data and High Availability :
8. The Role of the NameNode and How HDFS Works :
- HDFS—The Interaction between the NameNode and the DataNodes
- Interaction between the Clients and HDFS
- NameNode and DataNode Communications
- Rack Awareness and Topology
- How to Configure Rack Awareness in Your Cluster
- Finding Your Cluster’s Rack Information
- HDFS Data Replication
- HDFS Data Organization and Data Blocks
- Data Replication
- Block and Replica States
- How Clients Read and Write HDFS Data
- How Clients Read HDFS Data
- How Clients Write Data to HDFS
- Understanding HDFS Recovery Processes
- Generation Stamp
- Lease Recovery
- Block Recovery
- Pipeline Recovery
- Centralized Cache Management in HDFS
- Hadoop and OS Page Caching
- The Key Principles Behind Centralized Cache Management
- How Centralized Cache Management Works
- Configuring Caching
- Cache Directives
- Cache Pools
- Using the Cache
- Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)
- Performance Characteristics of Storage Types
- Changes in the Storage Architecture
- Storage Preferences for Files
- Setting Up Archival Storage
- Managing Storage Policies
- Moving Data Around
- Implementing Archival Storage
9. HDFS Commands, HDFS Permissions and HDFS Storage :
- Managing HDFS through the HDFS Shell Commands
- Using the hdfs dfs Utility to Manage HDFS
- Listing HDFS Files and Directories
- Creating an HDFS Directory
- Removing HDFS Files and Directories
- Changing File and Directory Ownership and Groups
- Using the dfsadmin Utility to Perform HDFS Operations
- The dfsadmin –report Command
- Managing HDFS Permissions and Users
- HDFS File Permissions
- HDFS Users and Super Users
- Managing HDFS Storage
- Checking HDFS Disk Usage
- Allocating HDFS Space Quotas
- Rebalancing HDFS Data
- Reasons for HDFS Data Imbalance
- Running the Balancer Tool to Balance HDFS Data
- Using hdfs dfsadmin to Make Things Easier
- When to Run the Balancer
- Reclaiming HDFS Space
- Removing Files and Directories
- Decreasing the Replication Factor
10 .Data Protection, File Formats and Accessing HDFS :
- Safeguarding Data
- Using HDFS Trash to Prevent Accidental Data Deletion
- Using HDFS Snapshots to Protect Important Data
- Ensuring Data Integrity with File System Checks
- Data Compression
- Common Compression Formats
- Evaluating the Various Compression Schemes
- Compression at Various Stages for MapReduce
- Compression for Spark
- Data Serialization
- Hadoop File Formats
- Criteria for Determining the Right File Format
- File Formats Supported by Hadoop
- The Ideal File Format
- The Hadoop Small Files Problem and Merging Files
- Using a Federated NameNode to Overcome the Small Files Problem
- Using Hadoop Archives to Manage Many Small Files
- Handling the Performance Impact of Small Files
- Using Hadoop WebHDFS and HttpFS
- WebHDFS—The Hadoop REST API
- Using the WebHDFS API
- Understanding the WebHDFS Commands
- Using HttpFS Gateway to Access HDFS from Behind a Firewall
- Summary
- 11. NameNode Operations, High Availability and Federation :
- Understanding NameNode Operations
- HDFS Metadata
- The NameNode Startup Process
- How the NameNode and the DataNodes Work Together
- The Checkpointing Process
- Secondary, Checkpoint, Backup and Standby Nodes
- Configuring the Checkpointing Frequency
- Managing Checkpoint Performance
- The Mechanics of Checkpointing
- NameNode Safe Mode Operations
- Automatic Safe Mode Operations
- Placing the NameNode in Safe Mode
- How the NameNode Transitions Through Safe Mode
- Backing Up and Recovering the NameNode Metadata
- Configuring HDFS High Availability
- NameNode HA Architecture (QJM)
- Setting Up an HDFS HA Quorum Cluster
- Deploying the High-Availability NameNodes
- Managing an HA NameNode Setup
- HA Manual and Automatic Failover
- HDFS Federation
- Architecture of a Federated NameNode
IV. Moving Data, Allocating Resources, Scheduling Jobs and Security :
12. Moving Data Into and Out of Hadoop :
- Introduction to Hadoop Data Transfer Tools
- Loading Data into HDFS from the Command Line
- Using the -cat Command to Dump a File’s Contents
- Testing HDFS Files
- Copying and Moving Files from and to HDFS
- Using the -get Command to Move Files
- Moving Files from and to HDFS
- Using the -tail and head Commands
- Copying HDFS Data between Clusters with DistCp
- How to Use the DistCp Command to Move Data
- DistCp Options
- Ingesting Data from Relational Databases with Sqoop
- Sqoop Architecture
- Deploying Sqoop
- Using Sqoop to Move Data
- Importing Data with Sqoop
- Importing Data into Hive
- Exporting Data with Sqoop
- Ingesting Data from External Sources with Flume
- Flume Architecture in a Nutshell
- Configuring the Flume Agent
- A Simple Flume Example
- Using Flume to Move Data to HDFS
- A More Complex Flume Example
- Ingesting Data with Kafka
- Benefits Offered by Kafka
- How Kafka Works
- Setting Up an Apache Kafka Cluster
- Integrating Kafka with Hadoop and Storm
- Summary
- 13. Resource Allocation in a Hadoop Cluster :
- Resource Allocation in Hadoop
- Managing Cluster Workloads
- Hadoop’s Resource Schedulers
- The FIFO Scheduler
- The Capacity Scheduler
- Queues and Subqueues
- How the Cluster Allocates Resources
- Preempting Applications
- Enabling the Capacity Scheduler
- A Typical Capacity Scheduler
- The Fair Scheduler
- Queues
- Configuring the Fair Scheduler
- How Jobs Are Placed into Queues
- Application Preemption in the Fair Scheduler
- Security and Resource Pools
- A Sample fair-scheduler.xml File
- Submitting Jobs to the Scheduler
- Moving Applications between Queues
- Monitoring the Fair Scheduler
- Comparing the Capacity Scheduler and the Fair Scheduler
- Similarities between the Two Schedulers
- Differences between the Two Schedulers
14. Working with Oozie to Manage Job Workflows :
- Using Apache Oozie to Schedule Jobs
- Oozie Architecture
- The Oozie Server
- The Oozie Client
- The Oozie Database
- Deploying Oozie in Your Cluster
- Installing and Configuring Oozie
- Configuring Hadoop for Oozie
- Understanding Oozie Workflows
- Workflows, Control Flow, and Nodes
- Defining the Workflows with the workflow.xml File
- How Oozie Runs an Action
- Configuring the Action Nodes
- Creating an Oozie Workflow
- Configuring the Control Nodes
- Configuring the Job
- Running an Oozie Workflow Job
- Specifying the Job Properties
- Deploying Oozie Jobs
- Creating Dynamic Workflows
- Oozie Coordinators
- Time-Based Coordinators
- Data-Based Coordinators
- Time-and-Data-Based Coordinators
- Submitting the Oozie Coordinator from the Command Line
- Managing and Administering Oozie
- Common Oozie Commands and How to Run Them
- Troubleshooting Oozie
- Oozie cron Scheduling and Oozie Service Level Agreements
15. Securing Hadoop :
- Hadoop Security—An Overview
- Authentication, Authorization and Accounting
- Hadoop Authentication with Kerberos
- Kerberos and How It Works
- The Kerberos Authentication Process
- Kerberos Trusts
- A Special Principal
- Adding Kerberos Authorization to your Cluster
- Setting Up Kerberos for Hadoop
- Securing a Hadoop Cluster with Kerberos
- How Kerberos Authenticates Users and Services
- Managing a Kerberized Hadoop Cluster
- Hadoop Authorization
- HDFS Permissions
- Service Level Authorization
- Role-Based Authorization with Apache Sentry
- Auditing Hadoop
- Auditing HDFS Operations
- Auditing YARN Operations
- Securing Hadoop Data
- HDFS Transparent Encryption
- Encrypting Data in Transition
- Other Hadoop-Related Security Initiatives
- Securing a Hadoop Infrastructure with Apache Knox Gateway
- Apache Ranger for Security Administration
V. Monitoring, Optimization and Troubleshooting:
16. Managing Jobs, Using Hue and Performing Routine Tasks:
- Using the YARN Commands to Manage Hadoop Jobs
- Viewing YARN Applications
- Checking the Status of an Application
- Killing a Running Application
- Checking the Status of the Nodes
- Checking YARN Queues
- Getting the Application Logs
- Yarn Administrative Commands
- Decommissioning and Recommissioning Nodes
- Including and Excluding Hosts
- Decommissioning DataNodes and NodeManagers
- Recommissioning Nodes
- Things to Remember about Decommissioning and Recommissioning
- Adding a New DataNode and/or a NodeManager
- ResourceManager High Availability
- ResourceManager High-Availability Architecture
- Setting Up ResourceManager High Availability
- ResourceManager Failover
- Using the ResourceManager High-Availability Commands
- Performing Common Management Tasks
- Moving the NameNode to a Different Host
- Managing High-Availability NameNodes
- Using a Shutdown/Startup Script to Manage your Cluster
- Balancing HDFS
- Balancing the Storage on the DataNodes
- Managing the MySQL Database
- Configuring a MySQL Database
- Configuring MySQL High Availability
- Backing Up Important Cluster Data
- Backing Up HDFS Metadata
- Backing Up the Metastore Databases
- Using Hue to Administer Your Cluster
- Allowing Your Users to Use Hue
- Installing Hue
- Configuring Your Cluster to Work with Hue
- Managing Hue
- Working with Hue
- Implementing Specialized HDFS Features
- Deploying HDFS and YARN in a Multihomed Network
- Short-Circuit Local Reads
- Mountable HDFS
- Using an NFS Gateway for Mounting HDFS to a Local File System
17. Monitoring, Metrics and Hadoop Logging :
- Monitoring Linux Servers
- Basics of Linux System Monitoring
- Monitoring Tools for Linux Systems
- Hadoop Metrics
- Hadoop Metric Types
- Using the Hadoop Metrics
- Capturing Metrics to a File System
- Using Ganglia for Monitoring
- Ganglia Architecture
- Setting Up the Ganglia and Hadoop Integration
- Setting Up the Hadoop Metrics
- Understanding Hadoop Logging
- Hadoop Log Messages
- Daemon and Application Logs and How to View
- Them
- How Application Logging Works
- How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run
- How the NodeManager Uses the Local Directories
- Storing Job Logs in HDFS through Log Aggregation
- Working with the Hadoop Daemon Logs
- Using Hadoop’s Web UIs for Monitoring
- Monitoring Jobs with the ResourceManager Web UI
- The JobHistoryServer Web UI
- Monitoring with the NameNode Web UI
- Monitoring Other Hadoop Components
- Monitoring Hive
- Monitoring Spark
18. Tuning the Cluster Resources, Optimizing MapReduce Jobs andBenchmarking :
- How to Allocate YARN Memory and CPU
- Allocating Memory
- Configuring the Number of CPU Cores
- Relationship between Memory and CPU Vcores
- Configuring Efficient Performance
- Speculative Execution
- Reducing the I/O Load on the System
- Tuning Map and Reduce Tasks—What the Administrator Can Do
- Tuning the Map Tasks
- Input and Output
- Tuning the Reduce Tasks
- Tuning the MapReduce Shuffle Process
- Optimizing Pig and Hive Jobs
- Optimizing Hive Jobs
- Optimizing Pig Jobs
- Benchmarking Your cluster
- Using TestDFSIO for Testing I/O Performance
- Benchmarking with TeraSort
- Using Hadoop’s Rumen and GridMix for Benchmarking
- Hadoop Counters
- File System Counters
- Job Counters
- MapReduce Framework Counters
- Custom Java Counters
- Limiting the Number of Counters
- Optimizing MapReduce
- Map-Only versus Map and Reduce Jobs
- How Combiners Improve MapReduce Performance
- Using a Partitioner to Improve Performance
- Compressing Data During the MapReduce Process
- Too Many Mappers or Reducers?
19. Configuring and Tuning Apache Spark on YARN :
- Configuring Resource Allocation for Spark on YARN
- Allocating CPU
- Allocating Memory
- How Resources are Allocated to Spark
- Limits on the Resource Allocation to Spark Applications
- Allocating Resources to the Driver
- Configuring Resources for the Executors
- How Spark Uses its Memory
- Things to Remember
- Cluster or Client Mode?
- Configuring Spark-Related Network Parameters
- Dynamic Resource Allocation when Running Spark on YARN
- Dynamic and Static Resource Allocation
- How Spark Manages Dynamic Resource
- Allocation
- Enabling Dynamic Resource Allocation
- Storage Formats and Compressing Data
- Storage Formats
- File Sizes
- Compression
- Monitoring Spark Applications
- Using the Spark Web UI to Understand Performance
- Spark System and the Metrics REST API
- The Spark History Server on YARN
- Tracking Jobs from the Command Line
- Tuning Garbage Collection
- The Mechanics of Garbage Collection
- How to Collect GC Statistics
- Tuning Spark Streaming Applications
- Reducing Batch Processing Time
- Setting the Right Batch Interval
- Tuning Memory and Garbage Collection
20 .Optimizing Spark Applications :
- Revisiting the Spark Execution Model
- The Spark Execution Model
- Shuffle Operations and How to Minimize Them
- A WordCount Example to Our Rescue Again
- Impact of a Shuffle Operation
- Configuring the Shuffle Parameters
- Partitioning and Parallelism (Number of Tasks)
- Level of Parallelism
- Problems with Too Few Tasks
- Setting the Default Number of Partitions
- How to Increase the Number of Partitions
- Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD
- Two Types of Partitioners
- Data Partitioning and How It Can Avoid a Shuffle
- Optimizing Data Serialization and Compression
- Data Serialization
- Configuring Compression
- Understanding Spark’s SQL Query Optimizer
- Understanding the Optimizer Steps
- Spark’s Speculative Execution Feature
- The Importance of Data Locality
- Caching Data
- Fault-Tolerance Due to Caching
- How to Specify Caching
21. Troubleshooting Hadoop—A Sampler :
- Space-Related Issues
- Dealing with a 100 Percent Full Linux File System
- HDFS Space Issues
- Local and Log Directories Out of Free Space
- Disk Volume Failure Toleration
- Handling YARN Jobs That Are Stuck
- JVM Memory-Allocation and Garbage-Collection Strategies
- Understanding JVM Garbage Collection
- Optimizing Garbage Collection
- Analyzing Memory Usage
- Out of Memory Errors
- ApplicationMaster Memory Issues
- Handling Different Types of Failures
- Handling Daemon Failures
- Starting Failures for Hadoop Daemons
- Task and Job Failures
- Troubleshooting Spark Jobs
- Spark’s Fault Tolerance Mechanism
- Killing Spark Jobs
- Maximum Attempts for a Job
- Maximum Failures per Job
- Debugging Spark Applications
- Viewing Logs with Log Aggregation
- Viewing Logs When Log Aggregation Is Not Enabled
- Reviewing the Launch Environment
22 .Installing VirtualBox and Linux and Cloning the Virtual Machines :
- Installing Oracle VirtualBox
- Installing Oracle Enterprise Linux
- Cloning the Linux Server
No comments:
Post a Comment