Project 1 – Working
with MapReduce, Hive, Sqoop
Topics :
This project is involved with working on the various Hadoop components like
MapReduce, Apache Hive and Apache Sqoop. Work with Sqoop to import data from
relational database management system like MySQL data into HDFS. Deploy Hive
for summarizing data, querying and analysis. Convert SQL queries using HiveQL
for deploying MapReduce on the transferred data. You will gain considerable proficiency
in Hive, and Sqoop after completion of this project.
Project
2 – Work on MovieLens
data for finding top records
Data
– MovieLens dataset
Topics :
In this project you will work exclusively on data collected through MovieLens
available rating data sets. The project involves the following important
components:
·
You will write a MapReduce program in order to find the top 10
movies by working in the data file
·
Learn to deploy Apache Pig create the top 10 movies list by
loading the data
·
Work with Apache Hive and create the top 10 movies list by
loading the
Project
3 – Hadoop YARN Project
– End to End PoC
Topics
: In this project you will work on a live Hadoop YARN
project. YARN is part of the Hadoop 2.0 ecosystem that lets Hadoop to decouple
from MapReduce and deploy more competitive processing and wider array of
applications. You will work on the YARN central Resource Manager. The salient
features of this project include:
·
Importing of Movie data
·
Appending the data
·
Using Sqoop commands to bring the data into HDFS
·
End to End flow of transaction data
·
Processing data using MapReduce program in terms of the movie
data, etc.
+
Project
4 – Partitioning
Tables in Hive
Topics
: This project involves working with Hive table data partitioning.
Ensuring the right partitioning helps to read the data, deploy it on the HDFS,
and run the MapReduce jobs at a much faster rate. Hive lets you partition data
in multiple ways like:
·
Manual Partitioning
·
Dynamic Partitioning
·
Bucketing
This
will give you hands-on experience in partitioning of Hive tables manually,
deploying single SQL execution in dynamic partitioning, bucketing of data so as
to break it into manageable chunks.
Project
5 – Connecting Pentaho
with Hadoop Ecosystem
Topics
: This project lets you connect Pentaho with the Hadoop
ecosystem. Pentaho works well with HDFS, HBase, Oozie and Zookeeper. You will
connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho
server and report designer. Some of the components of this project include the
following:
·
Clear hands-on working knowledge of ETL and Business
Intelligence
·
Configuring Pentaho to work with Hadoop Distribution
·
Loading, Transforming and Extracting data into Hadoop cluster
Project
6 – Multi-node cluster
setup
Topics
: This is a project that gives you opportunity to work on
real world Hadoop multi-node cluster setup in a distributed environment. The
major components of this project involve:
·
Running a Hadoop multi-node using a 4 node cluster on Amazon EC2
·
Deploying of MapReduce job on the Hadoop cluster
You
will get a complete demonstration of working with various Hadoop cluster master
and slave nodes, installing Java as a prerequisite for running Hadoop,
installation of Hadoop and mapping the nodes in the Hadoop cluster.
·
Hadoop Multi-Node Cluster Setup using Amazon ec2 – Creating 4
node cluster setup
·
Running Map Reduce Jobs on Cluster
Project
7 – Hadoop Testing
using MR
Topics :
In this project you will gain proficiency in Hadoop MapReduce code testing
using MRUnit. You will learn about real world scenarios of deploying MRUnit,
Mockito, and PowerMock. Some of the important aspects of this project include:
·
Writing JUnit tests using MRUnit for MapReduce applications
·
Doing mock static methods using PowerMock&Mockito
·
MapReduceDriver for testing the map and reduce pair
After
completion of this project you will be well-versed in test driven development
and will be able to write light-weight test units that work specifically on the
Hadoop architecture.
Project
8 – Hadoop Weblog
Analytics
Data –
Weblogs
Topics : This
project is involved with making sense of all the web log data in order to
derive valuable insights from it. You will work with loading the server data
onto a Hadoop cluster using various techniques. The various modules of this
project include:
·
Aggregation of log data
·
Processing of the data and generating analytics
The web
log data can include various URLs visited, cookie data, user demographics,
location, date and time of web service access, etc. In this project you will
transport the data using Apache Flume or Kafka, workflow and data cleansing
using MapReduce, Pig or Spark. The insight thus derived can be used for
analyzing customer behavior and predict buying patterns.
Project
9 – Hadoop Maintenance
Topics
: This project is involved with working on the Hadoop cluster for
maintaining and managing it. You will work on a number of important tasks like:
·
Administration of distributed file system
·
Checking the file system
·
Working with name node directory structure
·
Audit logging, data node block scanner, balancer
·
Learning about the properties of safe mode
·
Entering and exiting safe mode
·
HDFS federation and high availability
·
Failover, fencing, DISTCP, Hadoop file formats
No comments:
Post a Comment