Keep on Keeping on !!: Hadoop Interview Questions

What do you do for cluster management.
At midnight, you got a call saying there is no enough space i.e., HDFS threshold has been reached. What is the your approach to resolve this issue.
How many clusters and nodes are present in your project.
How you will know about the threshold, do you check manually every time. Do you know about puppet etc.,
Code was tested successfully in Dev and Test. When deployed to Productions it is failing. As an admin, how do you track the issue?
If namenode is down, whole cluster will be down. What is the approach to bring it back.
what is decommissioning?
You have decommissioned a node, can you add it back to cluster again. What about the data present in datanode when decommissioned.
10)

) Activities performed on Cloudera Manager
) How to start & stop namenode services
Most challenges that you went thru in your project
) How do you install Cloudera and namenode
) Background of current project
) If datanode is down, what will be the solution ( situation based question)
) More questions can be expected for Linux &Hadoop administration.

Daily activities.
versions.
What is decommissioning.
What is the procedure to decommission datanode.
Difference between MR1 and MR2.
Difference between Hadoop1 and Hadoop2.
Difference between RDBMS and No-SQL.
What is the use of Nagios.

How do you dopaswordless SSH in hadoop.
Upgrades (Have you done anytime).
ClouderaManager port number.
what is your cluster size.
Versions
Map reduce version.
Daily activities.
What operations, you normally use in cloudera manager.
is internet connected to your nodes.
Do you have different cloudera managers for dev and production.
what are installation steps

Initial Screening by Vendor for VISA Client Dt: 5th-Oct-2015 
1) What are your day to day activities.
2) How do you add datanode to the cluster.
3) Do you have any idea about dfs.name.dir?
4) What will happend when data node is down.
5) How you will test, whether datanode is working or not.
6) Do you have idea about Zoombie process.
7) How namenode will be knowing datanode is down.
Nagios alert, admin -report (command), cloudera manage  
8) Heat beat, whether it is sequential processing or parallel processing.
9) What is the volume of data you receive to the cluster.
    40 to 50GB
10) How do you receive data to your cluster.
11) What is your cluster size.
12) What is the port number of namenode.
13) What is the port number of Job tracker.
14) How do you install hive, pig, hbase.
15) What is JVM?
16) How do you do rebalancing.

Hexaware Interview  Dt: 10-Oct-2015 (41 Mins)
1) Tell me your day to day activities.
2) When adding datanode, do you bring down cluster.
3) What are the echo systems you have on your cluster.
4) Have you involved in cluster planning.
5) Who will take decision to add new data node.
6) Have you involved in planning for adding datanodes.
7) How do you do upgrades, is there any window.
8) When you are adding datanode, what is the impact of new blocks created by running jobs.
9) Do you have any idea about check pointing.
10) For check pointing, do Admin need to any activity or it is automatically taken care by cloudera.
11) Do you know about Ambari. Have you ever worked on Ambari or HortonWorks.
12) Do developers use map reduce programming on the cluster you are working.
13) Do you know, what type of data is coming from different systems to your cluster and what type of analysis is done on the same.
14) Do you have scala and strom in your application.
15) Do you use any oozie scheduler in the project.
16) What type of unix scripting is done.
17) whether your cluster is on any cloud.
18) When you are adding any datanode, do you do anything with configuration files.
19) How much experience you have on linux and scripting. How i

s your comfort level.
20) Do you have idea about data warehouse.
21) Have you worked on data visualization.
22) Who takes care of copying data from unix to HDFS, whether there is any automation.
23) Looks like, you joined on project which is already configured. Do you have hands-on on configuration cluster from scratch.
24) Have you ever seen hardware of nodes in the cluster. What is the configuration.
25) Have you used, Sqoop to pull data from different databases.
26) What is your cluster size.



DishNET Interview  Dt: 15-Oct-2015 (30 Mins)
1) Tell me about yourself.
2) What is meant by High availability.
3) Does HA happen automatically.
4) What are the services used for HA.
5) What are the benefits of YARN compare to Hadoop-1.
6) Have you done integration of map reduce to run HIVE.
7) Do you have experience on HBASE.
8) Could you explain the process of integration on Tableau.
9) What is the process of upgrading data node.
10) what are the schedulers used in hadoop.
11) How do you do load balancing.
12) when you add data node to cluster, how data will be copied to new datanode.
13) How you can remove 5 data nodes from cluster. Can you do it all at same time.
14) How do you give authorization to users.
15) How do you give permissions to a file like write access to one group and read access to other group.
16) How do you authenticate to HIVE tables for users.
17) How do you give LDAP access to HIVE for users.
18) Do you know about Kerberos.
19) Have you done upgrade CDH.
20) Do you need to bring down for CDH upgrade.
21) Have you worked on performance tuning of HIVE queries.
22) What type of performance tunings you have done.
23) Do you have idea about impala.
24) Do you know, how hadoop supports real time activity.
25) How do you allocate resource pool.
26) How do you maintain data in multiple disks on datanode.
27) Will there be any performance issue, if data is in different disks on datanode.

TCS Interview  Dt: 18-Oct-2015 (25Mins)
1) Hi, Where are you located. Are you fine to relocate to CA.
2) How much experience you have in big data area.
3) Could you give me your day to day activities?
4) What is the process to upgrade HIVE.
5) What is the way to decommission multiple data nodes.
6) Have you used rsync command.
7) How do you decommission a data node.
8) What is the process to integratemetastore for HIVE. Could you explain the process?
9) Do you have experience on scripting. If yes, is it Unix or python.
10) Have you worked on puppet.
11) Have you worked on other distributions like Horton works.
12) How do you delete files which are older than 7 days.
13) what is the way to delete tmp files from nodes. If there are 100 nodes, do you do it manually.
14) Have you involved in migration from CDH1 to CDH2.
15) If there is 20TB data in CHD1, What is the way to move it to CDH2.'
16) Have you worked on HBASE.
17) Do you know about Nagios and Ganglia. How graphs are used.
18) In Nagios, what are different options (conditions) to generate alerts.
19) Have you worked on Kerberos.
20) What is command for balancing the datanodes.


Impetus     21Oct2015
1) What ate your day to day activities.
2) What is the difference between root user and normal user.
3) Is your cluster on cloud. Do you have idea about cloud.
4) Are you racks present in any data center.
5) What Hadoop version you are using.
6) What is the process to add node to cluster. Do you have any standard process. Do you see physical servers.
7) What do you do for Tableau installation and integration.
8) What schedulers you suing in your project.
9) What is your cluster Size.
10) What issue you faced in your project. Do you login frequently.
11) How jobs are handled. Do developers take care of it or you involve.
12) Have you worked on sqoop and Oozie.
13) What are the echo systems you have worked.
14) Do you know about sentry.
15) Looks like, you have worked on Cloudera Manager. What is comfort level on manual and Hortonworks.
16) Have you done any scripting.

Wipro  (15 Mins)    Dt: 28-Oct-2015

1) What is your experiance in big data space.
2) What are your day to day

activities.
3) Responsibilities you are saying should be automated by now, what is your actual work in it.
4) Have you seen a situation, where mapreduce program is not performing well which used to execute properly before. What is your approch to resolve the issue.
5) Do you came accrosee the issue, where sort and suffle was causing issue in mapreduce program.
6) Have you worked on Kafka.
7) What are the reporting toole you are using.
8) Any experience on spark.
9) What are the chanllenges you faced.
10) I will inform employer, he will notify next steps.

EMC Interview (45 Mins)                             Dt: 04-Nov-2015
01) Could you explain your big data experience.
02) Could explain about your environment, how many clusters.
03) What is the size of your cluster.
04) How is data loaded into HIVE.
05) What is the configuration of nodes.
06) What do you do for map reduce performance tuning.
07) What are the parameters and values used for tuning.
08) What will happen, when you change those values.
09) What else are used for tuning, other than reducer.
10) which components are there between mapper and reducer.
11) What are the steps to install Kerberos.
12) How do you integrate Kerberos in Tableau.
13) Do you have idea about SQOOP, FLUME.
14) What type of files come into your application.
15) Have you worked on un-structured files.
16) What type of tables you are using in HIVE, internal or external tables.
17) Do you have idea about HUE.
18) Where HUE is installed.
19) How do you give access to HUE and how Kerberos is integrated in HUE.
20) Do you have idea about SPARK, SPLUNK.
21) Could you explain unix scripts you have developed till now.
22) What are the routine unix command you use.
23) How do you check I/O operations in unix.
24) What is the size of the container in your project.
25) What is the architecture of your project, how does data comes.
26) Do you have experience on Teradata.
27) What is the difference between Teradata and Oracle.
28) What are the utilities used in teradata.

CTS Interview (Raja)      Dt: 04-Nov-2015
1)how you will give access to Hue 
2)what is rebalancing 
3)what will be needed from user for Karbarose 
4)Java heap issue 
5)explain about sqoop 
6)Expain about oozie 
7)where log files wil be stored 
tar -cvf
8)what is Master and region server 
9)What is Edge node 
10)expalin yarn 
11)High availability 
12)what is the responsability of zookeeper 
13)What needs to be done in order to run the standby node 
14)Decommission of datanode 
15)Cluster details 
16)Scalability 
17)How you will check the upgradation is successful 
18)schedulers 
19)what will be the steps you perform when a process got failed 
20)recent issues you got faced 
21)what are the recent issues you faced 
22)Shell scripting 
23)what will be the precations you will take in order to avoid single point of failure 
24)what is your backup plan 
25)how will you upgrade the cloudera manager from 5.3 to 5.4

Infosys (Secound Round) (Hari)      Dt: 04-Jan-2016
1. what is distribution you use and how did you upgrade from 5.3 to 5.4
2. are you upgrading in node.. how?
3. How do you copy config files to other nodes
4. what security system you follows, what is diff with out kerberos
5. What is JN, HA
6. what is usage of SNN
7. usage of Automatic failover , how you do ? what all r other methods?
8. How do you load data for teradata to Hadoop
9. Are you using IMpala?
10. what is cluster size
11. How do you install the cloudera manager 
12. what is iQuery
13. You already had dev exp, going to ask question n Deve
14. What Unix your using and how to find the OS full details.

ZNA Interview (Anusha)        04-Jan-2016
1)Roles and responsibilities in current project
2)What do you monitor in cluster i.e; What do you monitor to ensure that cluster is in healthy state?
3)Are you involved in planning and implementation of Hadoop cluster. What are the components that need to keep in mind while planning hadoop Cluster.
4)You are given 10 Empty boxes with 256GB RAM and good Hardware conditions, How will you plan your cluster with these 10 boxes when there

is 100GB of data to come per day.( Steps right from Begining i.e; chosing OS, chossing Softwares to be installed on empty boxes, Installation steps to install RedHat Linux)
5) Steps to install Cloudera Hadoop.
6) What is JVM?
7) What is Rack awareness?? 
8) What is Kerberos security and how will you install and enable it using CLI and how to integrate it with Cloudera manager.
9) What is High Availability? How do you implement High availability on a pre existing cluster with single node? What are the requirements to implement HA.
10) What is HIVE? How do you install and configure from CLI.
11) What is Disc Space and Disc Quota
12) How to add data nodes to your cluster without using Cloudera Manager.
13) How to add Disk space to Datanode which is already added to cluster. And how to format the disk before adding it to cluster.
14) How good r u at shell scripting? Have you used shell scripting to automate any of your activities. 
What are the activities that r automated using shell scripting in your current project? 
15) What are the benefits of YARN compare to Hadoop-1.
17) Difference between MR1 and MR2?
18) Most challenges that you went through in your project.
19) Activities performed on Cloudera Manager
20) How you will know about the threshold, do you check manually every time. Do you know about puppet etc.,
21) How many clusters and nodes are present in your project.
22) You got a call when u r out of office saying there is no enough space i.e., HDFS threshold has been reached. What is the your approach to resolve this issue. 
23) Heat beat messages, Are they sequential processing or parallel processing. 
24) What is the volume of data you receive to your cluster every day.
25) What is HDFS?
26) How do you implement SSH, SCP and SFTP in Linux
27) What are the services used for HA.
28) Do you have experience on HBASE.
29) Does HA happen automatically.

Accenture (Anasha)        Dt: 06-Jan-2016
1) What are your daily activities? And What are your roles and responsibilities in your current project? What are the services that are implemented in your current project?
2) What have you done for performance tunning
3) What is the block size in your project?
4) Explain your current project process
5) Have you used Storm Kafka or Solr services in your project?
6) Have you used puppet tool
7) Have you used security in your project? Why do you use security in your cluster?
8) Explain how kerberos authentication happens?
9) What is your cluster size and what r the services you are using?
10) Do you have good hands on experience in Linux
11) Have you used Flume or Storm in your Project? 
12) Explain the purpose of HBASE in your Cluster.
Fedility (Hari)       Dt: 07-Jan-2016
1. What security authenication you are using. how you are managing?
2. about Centry, sercurity authentication ?
3. how do you do schedue the jobs in Fair scheduler
4. priortizing jobs
5. how you are doing Accenterl control for HDFS?
6. Disaster Recovery activites
7. what issues you are faced so far
8. do you know about puppet
9. hadoop devlopement activites
FWD

Barclays[04th Aug]
a) Describe urself
b) How u moved to hadoop
c) ur previous org and prev n current hadoop projects

technical
1)daemons in hadoop
2)Safe Mode in HDFS
3)distcp
4)reformat the namenode
5)after adding new datanodes to the cluster, what is required
6)MR jobs taking long time than usual. steps to improve perf
7)copy data from hadoop to some other db, diff ways
8)fetching a pattern from logs
9)case where 0 mapper / 0 reducer
10)Identity mapper and identity reducer
11)how blocks are replicated in hadoop
12)50030 n 50070 default port for?
13)core-site.xml / hdfs-site.xml / mapred-site.xml
14)pros n cons of distributed cache

MR
1)client facing - if any prod issue occur due to your junior, how to face this
2)deadlines- u know that you can not complete the task in given deadlines
3)client solution is not feasible and he force to go with that
4)need a help from other db team, but manager denied to take help from him
5)support activity related

1) What is Hadoop?

Hadoop is a distributed computing platform. It is written in Java. It consist of the features like Distributed File System and MapReduce Processing.

2) What platform and Java version is required to run Hadoop?

Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.

3) What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

4) What are the most common input formats defined in Hadoop?

These are the most common input formats defined in Hadoop:

TextInputFormat
KeyValueInputFormat
SequenceFileInputFormat
TextInputFormat is a by default input format.
5) What is Input Block in Hadoop? Explain.

When a Hadoop job runs, it blocks input files into chunks and assign each split to a mapper for processing. It is called Input block.

6) How many Input blocks is made by a Hadoop Framework?

The default block size is 64MB, according to which, Hadoop will make 5 Block as following:

One Block for 64K files
Two Block for 65MB files, and
Two Block for 127MB files
The block size is configurable.

7) What is the use of RecordReader in Hadoop?

Input Block is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

8) What is JobTracker in Hadoop?

JobTracer is a service within Monitors and assigns Map tasks and Reduce tasks to corresponding task tracker on the data nodes

9) What are the functionalities of JobTracker?

These are the main tasks of JobTracker:

To accept jobs from client.
To communicate with the NameNode to determine the location of the data.
To locate TaskTracker Nodes with available slots.
To submit the work to the chosen TaskTracker node and monitors progress of each tasks.
10) Define TaskTracker.

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.

11) What is Map/Reduce job in Hadoop?

MapReduce is programming paradigm which is used to allow massive scalability across the thousands of server.

Actually MapReduce refers two different and distinct tasks that Hadoop performs. In the first step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It takes the output from the map as input and compress those data tuples into smaller set of tuples. Click here for more information about MapReduce.

12) What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

13) What is a combiner in Hadoop?

A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.

14) Is it necessary to know java to learn Hadoop?

If you have a background in any programming language like C, C++, PHP, Python, Java etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

15) How to debug Hadoop code?

There are many ways to debug Hadoop codes but the most popular methods are:

By using Counters.
By web interface provided by Hadoop framework.
16) Is it possible to provide multiple inputs to Hadoop? If yes, explain.

Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.

17) What is the relation between job and task in Hadoop?

In Hadoop, A job is divided into multiple small parts known as task.

18) What is distributed cache in Hadoop?

Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text

, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

19) What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?

Hadoop job – list
Hadoop job – kill jobID
20) What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker run on Hadoop cluster?

JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.

Functionalities of JobTracker in Hadoop:

When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
It locates TaskTracker nodes with available slots for data.
It assigns the work to the chosen TaskTracker nodes.
The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.
21) How JobTracker assign tasks to the TaskTracker?

The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.

22) Is it necessary to write jobs for Hadoop in Java language?

No, there are many ways to deal with non-java codes. Hadoop Streaming allows any shell command to be used as a map or reduce function.

Hive Interview Questions and Answers

23) What is Apache Hive?

Apache Hive is a data warehouse software which is used to facilitate managing and querying large data sets stored in distributed storage. Hive also permits traditional MapReduce programs to customize mappers and reducers when it is inefficient to run the logic in HiveQL.

24) How Facebook Uses Hadoop, Hive and HBase?

Facebook data is stored on HDFS, daily numerous photos uploaded within Facebook server with the help of Facebook Messages, Likes and statues updates running on top of HBase. Hive generate reports for third party developers and advertisers who need to find the success of their campaigns or applications.

25) What is the difference between HBase and Hive?

Apache Hive is a data warehouse infrastructure which is built on top of Hadoop. Hive permits for querying data that is stored on HDFS for analysis via HQL, an SQL-like coding language that gets converted to MapReduce jobs. In spite of providing SQL functionality, Hive doesn’t provide interactive querying – it only executes batch processes on Apache Hadoop.

Apache HBase is a NoSQL value store, that runs on top of Hadoop File System. HBase operations executes in real-time on database instead of MapReduce jobs. HBase is divided into tables, and tables are further divided into column families. Column families, must be declared in schema.

26) What is Hive Metastore?

Hive Metastore is a database that saves metadata of your hive tables including table name, data types, column name,table location, number of buckets, etc.

27) Hive new version supported Hadoop Versions?

Latest version of Hive is 2.0

28) Which companies are mostly using Hive?

Facebook and Netflix

29) Wherever (Different Directory) I run hive query, it creates new metastore_db, please explain the reason for it?

Whenever you execute the Apache Hive in an embedded mode, it creates the local Metastore. Before creating Metastore Hive looks whether Metastore is already created or not. This property is detailed in configuration file Hive – site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”.

30) Is it possible to use same metastore by multiple users, in case of embedded Hive?

No, Metastore can’t be used in sharing mode. It is suggested to use stand-alone “real” database such as MySQL or PostGresSQL.

31) What is the usage of Query Processor in Apache Hive?

Query processing implements the p

rocessing framework for translating SQL to a graph of MapReduce jobs.

32) Is multi line comment supported in HIVE Script?

NO

33) What is a Hive Metastore?

It is a central repository that saves metadata in external database.

34) Explain about the SMB Join in Hive.

In Sort Merge Bucket (SMB) join in Hive, all mapper reads a bucket from the 1st table and the equivalent bucket from the second table and then a merge sort join is conducted. SMB is mainly utilized as there is no limit on partition or file or table join. It can best be utilized when the tables are very large. In SMB join the columns are sorted and bucketed using the join columns. In SMB join all tables must have the same number of buckets.

35) Explain about the different types of join in Hive.

HiveQL has 4 different types of joins –

JOIN- Similar to Outer Join in SQL
FULL OUTER JOIN– It combines the records of both the right and left outer tables that fulfil the join condition.
RIGHT OUTER JOIN– Each of the rows from the right table are reverted even though there are no match in the left table.
LEFT OUTER JOIN– Each of the rows from the left table are reverted even though there are no match in the right table.
36) What is ObjectInspector usage?

ObjectInspector is utilized to analyze the internal structure of the row objects and the structure of individual columns. ObjectInspector in Hive allows access to complex objects that can be saved in various formats.

37) Is it possible to change the default location of Managed Tables in Hive, if so how?

Yes, you can alter the default location of managed tables by utilizing the LOCATION keyword while creating the managed table. The user has to specify the path of the managed table as the value to the LOCATION keyword.

38) How can you connect an application, if you run Hive as a server?

When execute Hive as a server, the application can be connected in one of the three ways-

ODBC Driver-This supports the ODBC protocol
JDBC Driver-This supports the JDBC protocol
Thrift Client-Thrift client can be utilized to make calls to all hive commands using programming language such as PHP, Python, Java, C++ and Ruby.
39) Which classes are used by the Hive to Read and Write HDFS Files

Hive uses following classes to perform read and write operations:

TextInputFormat/HiveIgnoreKeyTextOutputFormat: These classes perform read and write data in simple text file format.

SequenceFileInputFormat/SequenceFileOutputFormat: These classes read/write data in hadoop SequenceFile format.

40) What are the types of tables in Apache Hive?

There are two types of tables in Apache Hive

Managed tables.
External tables.
41) Is it possible to create multiple table in hive for same data?

Yes

42) What kind of Data Warehouse application is suitable for Hive?

Apache Hive is not a full database. The design limitations of Hadoop and HDFS impose limits on what Hive can perform. Apache Hive is well built for data warehouse applications, where

Relatively static data is analyzed,
Fast response times are not required, and
When the data is not changing rapidly.
Hive does not provide crucial properties needed for OLTP, Online Transaction Processing. Hive is well equipped for data warehouse applications, where a large data set is handled for insights, reports, etc.

43) What is the maximum size of string data type supported by Hive?

Maximum size is 2 GB.

MapReduce Interview Questions
44) What is MapReduce in Hadoop?

MapReduce is a framework for processing huge raw data sets utilizing a large number of computers. It helps to processes the raw data in two phases i.e. Map and Reduce phase. MapReduce programming model can be easily processed on large scale data. It is integrated with HDFS for processing distributed across data nodes of clusters.

45) What is YARN?

Yet Another Resource Negotiator or YARN is a Next generation MapReduce or MapReduce 2 or MRv2. It is applied in hadoop 0.23 release to overcome the scalability issue in classic MapReduce framework by dividing the functionality of Job tracker in MapReduce framework into Resource Manager.

4

6) What is data serialization?

Serialization is the way of converting object data into byte data stream for transmission over a network across different nodes in a cluster or for persistent data storage.

47) What is deserialization of data?

Deserialization is the inverse process of serialization and changes byte stream data into object data for reading data from HDFS. Apache Hadoop provides Writable for deserialization and serialization purpose.

48) What are the value/key Pairs in MapReduce framework?

MapReduce framework implements a data model in which data is shown as value/key pairs. Both output and input data to MapReduce framework should be in value/key pairs only.

49) What are the constraints to Value and Key classes in MapReduce?

Any datatype that can be utilized for a value/key field in a mapper or reducer must implement org.apache.hadoop.io.Writable for enabling the field to be deserialized and serialized. By default key/value fields should be comparable with each other. So, these must implement Hadoop’s org.apache.hadoop.io.WritableComparable Interface which in return extends Hadoop’s Writable interface.

50) What are the main components of MapReduce Job?

The key components of MapReduce are Main driver class, Mapper class and Reducer class.

51) What are the key configuration parameters that user require to specify to run MapReduce Job?

The user of MapReduce framework require to specify the following things:

Job’s output location in the distributed file system.
Job’s input location(s) in the distributed file system.
Input format.
Output format.
Class containing map function.
Class containing reduce function, but it is optional.
JAR file containing the reducer and mapper classes and driver classes.
52) What are the key components of Job flow in YARN architecture?

MapReduce job flow on YARN involves below components.
A Client node, which submits the MapReduce job.
YARN Node Managers, which launch and monitor the tasks of jobs.
MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
YARN Resource Manager, which allocates the cluster resources to jobs.
HDFS file system is used for sharing job files between the above entities.
53) What is the importance of Application Master in YARN architecture?

It helps in negotiating resources from the resource manager and working with the Node Manager(s) to run and monitor the tasks. Application Master makes request to containers for all map and reduce tasks. As Containers are assigned to tasks, it starts containers by reporting its Node Manager. It collects progress information from all the tasks and values are propagated to user or client node.

54) What is identity Mapper Apache Hadoop?

It is a default Mapper class provided by Apache Hadoop. Identity Mapper doesn’t process or manipulate or perform any task on input data rather it just writes the output data into input. Identity Mapper class name is org.apache.hadoop.mapred.lib.IdentityMapper.

55) What is identity Reducer in Apache Hadoop?

Identity Mapper just passes on the input value/key pairs into output directory. Identity Reducer class name is org.apache.hadoop.mapred.lib.IdentityReducer. When none of the reducer class is specified within MapReduce job, then this class will be taken up automatically by the job.

56) What is chain Mapper?

It is a special implementation of Mapper class through which number of mapper classes can be executed in a chain fashion, within a single map task. Chain Mapper’s class name is org.apache.hadoop.MapReduce.lib.ChainMapper.

57) What is chain reducer?

It is similar to Chain Mapper class through which a number of mappers followed by a single reducer can be executed in a single reducer task. Chain Mapper class name is org.apache.hadoop.MapReduce.lib.ChainReducer.

58) How to mention multiple mappers and reducer classes in Chain Reducer or Chain Mapper classes?

In Chain Mapper, ChainMapper.addMapper() method is used to add classes in mapper. In ChainReducer,

ChainReducer.setReducer() method is utilized to pinpoint the single reducer class.
Cha

inReducer.addMapper() method is used to add mapper classes.
59) What is side data distribution in MapReduce framework?

Side data is the extra read-only data needed by a MapReduce job to perform task on the main data set. In Hadoop there are 2 ways to make side data available to all the reduce or map tasks:

Distributed cache
Job Configuration
60) How distribution of side data can be done using job configuration?

By setting up an arbitrary key-value pairs in the job configuration using the various setter methods on Configuration object side data can be distributed. Within the task, we can get the data from the configuration method Context’s getConfiguration() method.

61) When can side data distribution be used for Job Configuration and when it is not supposed?

Side data distribution by job configuration is useful only when programmer need to pass a piece of meta data to map or reduce tasks. This mechanism shouldn’t be followed for moving more than a few KB’s of data because it imparts pressure on the memory usage, mainly in a system running hundreds of jobs.

62) What is Distributed Cache in MapReduce?

It is another way for side data distribution by duplicating files and archives to task nodes in time for tasks to use them when they execute. For saving network bandwidth, files are usually copied to any specific node once per job.

63) How to provide files or archives to MapReduce job in distributed cache mechanism?

Files that require to be distributed can be specified as a comma-separated list of URIs as the argument to the -files option in Apache Hadoop job command. Files can be on HDFS.
Archive files (tar files, ZIP files, and gzipped tar files) can be copied to task nodes by distributed cache by usage of -archives option.

64) Explain how distributed cache works in MapReduce Framework?

When Apache MapReduce job is submitted with distributed cache options, the node managers duplicates the files specified by the –archives, -files, and -libjars options from distributed cache to a local storage disk. local.cache.size property can be used to configure setup cache size on local storage disk of node managers. Data is localized under the ${hadoop.tmp.dir}/mapred/local directory.

65) What will Apache Hadoop do when a task has failed in a list of suppose 50 spawned tasks?

Apache Hadoop will restart the reduce or map task again on some other node manager & only if the task fails more than four times then it will kill the task. The default limit of maximum attempts for map and reduce tasks can be determined by using below mentioned properties in mapred-site.xml file.

MapReduce.map.maxattempts
MapReduce.reduce.maxattempts
Assume: In MapReduce system, HDFS block size is 256 MB and we have 3 files of size 248 KB, 268 MB and 512 MB then how many input splits will be created by Hadoop framework?
Hadoop will create 5 splits as follows

1 split for 248 KB file
2 splits for 268 MB file (1 of 256 MB and another of 12 MB)
2 splits for 512 MB file (2 Splits of 256 MB)
66) Why can’t we just have the file in HDFS and have the application read it instead of distributed cache?

Distributed cache duplicates the file to all node managers at the beginning of the job. Now, if the node manager runs 10 or 50 reduce or map tasks, it will utilize the same file copy.

Besides this, if a file requires to read from HDFS in the job then every reduce or map task will access it from HDFS and therefore if a node manager runs 50 map tasks then it will read this file 50 times from HDFS. Accessing the same data from node manager’s Local FS is lot faster than from HDFS data nodes.

67) After restarting namenode, MapReduce jobs started to fail which where working fine before restart. What may be the reason for such failure?

The Hadoop cluster may be in safe mode after the restart of namenode. The administrator waits for namenode to exit safe mode before restarting jobs again. This is one of the most common mistake done by Hadoop administrators.

68) What are the things that you need to mention for a MapReduce job?

A. Classes for mapper and reducer.

B.

Classes for reducer, mapper, and combiner.

C. Classes for the reducer, partitioner, mapper, and combiner.

D. None

Answer: A) The classes for the mapper and reducer.

69) How many times combiner will execute?

A. At least once.

B. 0 or 1 time.

C. 0, 1, or many times.

D. Can’t be configured

Answer: C) Zero, one, or many times.

70) Suppose you have a mapper which produces for each key an integer value and the below set of

Reducer A: Give the maximum of the set of values.

Reducer B: Give the sum of the set of integer values.

Reducer C: Give the mean of the set of values.

Reducer D: Give the difference b/w the largest and smallest values in the set.

71) Which of the above mentioned reduce operations can be safely used as a combiner?

A. All of them.

B. A and B.

C. A, B, and D.

D. C and D.

E. None of them.

Answer: B) A and B.

72) What is Uber task in YARN?

If the job is not big, the application master may run them in the same JVM as itself, since it judges the overhead of allocating new containers and executing tasks in them as outweighing the advantage to be had in executing them in parallel, compared to executing them sequentially on the one node.

73) How to configure Uber Tasks?

A job by default has less than ten mappers only and 1 reducer, and the size of input is less than the size of 1 HDFS block is said to be small job. Values may be altered for a job by setting MapReduce.job.ubertask.maxreduces, MapReduce.job.ubertask.maxmaps , and MapReduce.job.ubertask.maxbytes. It is also possible to disable Uber tasks completely by setting MapReduce.job.ubertask.enable to false.

74) What are the three ways to debug a failed MapReduce job?

By using MapReduce job counters
Commonly there are two ways.
YARN Web UI for checking into syslogs for actual status or error messages.
75) What is the significance of heartbeats in HDFS/MapReduce Framework?

A heartbeat in master or slave architecture is a signal depicting that it is active. A datanode sends heartbeats to Namenode and then node managers delivers their heartbeats to Resource Managers to say the master node that these are still active.

76) Can we rename the output file?

Yes

77) What are the default formats of input and output file in MapReduce jobs?

The default file format of input or output file are considered as text files, if they are not set.

Keep on Keeping on !!

Monday, September 17, 2018

Hadoop Interview Questions

No comments:

Post a Comment

Hyderabad Trip - Best Places to visit

Followers

Pages