Big Data - Hadoop: November 2015

Like in other programming languages i.e., C, C++, JAVA,etc., we learn a basic program called "Hello World", on the same ground, in Hadoop, there is a basic program named "Word Count", which uses both Map and Reduce concept.

Important Note: In Hadoop context, whenever we execute the

hadoop jar <JAR File name> <DriverClassNameWithoutExtension> <HDFS Input Path> <Hadoop Output Path>

we have to make sure that HDFS output path should not exist beforehand.
To avoid this situation, the same thing is handled as a part of Driver code, where, if the output folder is already present in HDFS, it will be deleted first and then the processing will start.

The program is explained in detailed manner as below:

Requirement:
Inputfile.txt contains following data

Inputfile.txt
------------------------------------------------------------------------------------------------
Hadoop is a big data analytics tool.
Hadoop has different components like MapReduce, Pig, hive, hbase, sqoop etc.
MapReduce is used for processing the data using Java.
-------------------------------------------------------------------------------------------------

We have to count the number of occurrences of each word in the file.
If there is same word, but change is in case(uppercase/lowercase) format, then it will consider as different words.
E.g., Hadoop and hadoop are considered as different words are per the logic implemented.

So the Output should be as below:

Hadoop	2
is	4
a	1
big	1
data	2
analytics	1
tool.	1
has	1
different	1
components	1
like	1
MapReduce,	1
Pig,	1
hive,	1
hbase,	1
sqoop	1
etc.	1
MapReduce	1
for	3
processing	1
the data	1
using	1
Java.	1

Logic being used in Map-Reduce

There may be different ways to count the number of occurrences for the words in the text file, but Map reduce uses the below logic specifically.

Input
Hadoop is a big data analytics tool.
Hadoop has different components like MapReduce, Pig, hive, hbase, sqoop etc.
MapReduce is used for processing the data using Java.

Input to Mapper
1010L Hadoop is a big data analytics tool.
1012L Hadoop has different components like MapReduce, Pig, hive, hbase, sqoop etc.
1013L MapReduce is used for processing the data using Java.

After the Mapper phase
Fetch the each word and assign the 1. It indicates that Mapper found each word once.

Hadoop	1
is	1
a	1
big	1
data	1
analytics	1
tool.	1
Hadoop	1
has	1
different	1
components	1
like	1
MapReduce,	1
Pig,	1
hive,	1
hbase,	1
sqoop	1
etc.	1
MapReduce	1
is	1
used	1
for	1
processing	1
the	1
data	1
using	1
Java.	1

After Reducer Phase

Gets the similar word from Mapper output and sum it up.

Hadoop	2
is	2
a	1
big	1
data	2
analytics	1
tool.	1
has	1
different	1
components	1
like	1
MapReduce,	1
Pig,	1
hive,	1
hbase,	1
sqoop	1
etc.	1
MapReduce	1
for	1
processing	1
the data	1
using	1
Java.	1

which is the final output, having count of each word in the input file.

The program has each and every details, so that even a newbie in Java can write the Wordcount program

Note: I am using Java 1.6 and eclipse Helios, for this development.
For Hadoop, Java 1.6 is minimum requirement. We can use any flavor of eclipse for Hadoop development.

Open Eclipse, for creating the Java project.
Go to File->New->Java Project
Enter the name in Project Name. I used WordCountDemoApp as name to my Word count Java project.
Click Finish,
Go to Project Explorer. You will notice that WordCountDemoApp is created as the project, with subfolders namely src and JRE System Library [jre6]
Right click on src folder and select New->Class.
Enter the Package Name as MapReduceWordCount(optional), Name of the Main class as WordCount (mandatory) and tick the checkbox for "public static void main(String[] args)"

Copy the below code in the eclipse.

Detailed comments has been added in the code for each line.

 package MapReduceWordCount;  
 import java.io.IOException;  
 import java.util.StringTokenizer;  
 import org.apache.hadoop.conf.Configuration;  
 import org.apache.hadoop.fs.FileSystem;  
 import org.apache.hadoop.fs.Path;  
 import org.apache.hadoop.io.IntWritable;  
 import org.apache.hadoop.io.Text;  
 import org.apache.hadoop.mapreduce.Job;  
 import org.apache.hadoop.mapreduce.Mapper;  
 import org.apache.hadoop.mapreduce.Reducer;  
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
 /*Driver class*/  
 public class WordCount {  
      // Class for Mapper, to convert the string into tokens  
      /*  
       * e.g., I/P - Hadoop is big data tool, which is good for data analytics  
       */  
      /*  
       * O/P - Hadoop - 1 is - 1 big - 1 data - 1 tool - 1 which - 1 is - 1 good -  
       * 1 for - 1 data - 1 analytics - 1  
       */  
      public static class TokenizerMapper extends  
                Mapper<Object, Text, Text, IntWritable> {  
           // Constant variable for 1  
           private final static IntWritable ONE = new IntWritable(1);  
           // Declaring word variable of type Text  
           private final Text word = new Text();  
           // Overrided function from Mapper class  
           @Override  
           public void map(Object key, Text value, Context context)  
                     throws IOException, InterruptedException {  
                // Fetching each line from the file into tokens(words) separated by  
                // SPACE and COMMA  
                StringTokenizer str = new StringTokenizer(value.toString(), " ");  
                // Running the below loop till EOF of the current file  
                while (str.hasMoreTokens()) {  
                     // Set the word with the next available token  
                     word.set(str.nextToken());  
                     //  
                     // Use below line of code instead of above, if want to recognize  
                     // hadoop and Hadoop as same word. Converting all words to  
                     // lowercase before writing it to context....  
                     // word.set(str.nextToken().toLowerCase());  
                     // Write the K, V into context  
                     context.write(word, ONE);  
                }  
           }  
      }  
      // Class for Reducer, to count the words from each Mapper's output  
      /*  
       * e.g., I/P -> Hadoop - 1 is - 1 big - 1 data - 1 tool - 1 which - 1 is - 1  
       * good - 1 for - 1 data - 1 analytics - 1  
       */  
      /*  
       * O/P -> Hadoop - 1 is - 2 big - 1 data - 1 tool - 1 which - 1 good - 1 for  
       * - 1 data - 1 analytics - 1  
       */  
      public static class IntSumReducer extends  
                Reducer<Text, IntWritable, Text, IntWritable> {  
           // variable to store the result  
           IntWritable result = new IntWritable();  
           // Press Ctrl - 3 and type 'override' to get the overrided methods  
           // Overrided function from Reducer class  
           @Override  
           public void reduce(Text key, Iterable<IntWritable> values,  
                     Context context) throws IOException, InterruptedException {  
                // initializing the sum to 0  
                int sum = 0;  
                /*  
                 * Using foreach loop iterating through all the Mapper output to  
                 * count the words  
                 */  
                for (IntWritable value : values) {  
                     sum = sum + value.get();  
                }  
                /* Set the result with sum value */  
                result.set(sum);  
                /* Write the K,V into context */  
                context.write(key, result);  
           }  
      }  
      public static void main(String[] args) throws IOException, Exception {  
           // Configutation details w. r. t. Job, Jar file  
           /* Provides access to configuration parameters */  
           Configuration conf = new Configuration();  
           /* Creating the job object for the Hadoop processing */  
           Job job = new Job(conf, "WORDCOUNTJOB");  
           /* Creating Filesystem object with the configuration */  
           FileSystem fs = FileSystem.get(conf);  
           /* Check if output path (args[1])exist or not */  
           if (fs.exists(new Path(args[1]))) {  
                /* If exist delete the output path */  
                fs.delete(new Path(args[1]), true);  
           }  
           // Setting Driver class  
           job.setJarByClass(WordCount.class);  
           // Setting the Mapper class  
           job.setMapperClass(TokenizerMapper.class);  
           // Setting the Combiner class  
           job.setCombinerClass(IntSumReducer.class);  
           // Setting the Reducer class  
           job.setReducerClass(IntSumReducer.class);  
           // Setting the Output Key class  
           job.setOutputKeyClass(Text.class);  
           // Setting the Output value class  
           job.setOutputValueClass(IntWritable.class);  
           // Adding the Input path  
           FileInputFormat.addInputPath(job, new Path(args[0]));  
           // Setting the output path  
           FileOutputFormat.setOutputPath(job, new Path(args[1]));  
           // System exit strategy  
           System.exit(job.waitForCompletion(true) ? 0 : 1);  
      }  
 }

Adding Dependent JAR files for Hadoop

Below are the 2 jar files which are required for this development.

hadoop-ant-0.20.2-cdh3u5.jar
hadoop-core-0.20.2-cdh3u5.jar

Process to add dependent JARs in eclipse

Right click on the Project WordCountDemoApp
Go to Build Path | Configure Build Path..
Click Add External JARs...

Browse the folder and select the required JARs and Click OK.

How to create the JAR file?

Right click on the project ( MapReduceWordCount in this case)
Click Export.
Select JAR file option from the list. Click Next.
Select the Project correctly and specify the path on which JAR file should be created.
Click Next.
Click Next.
Keep Main Class empty. Do not specify anything in Main Class
Click Finish.

Execution steps

Copy the Jar file to the appropriate location in LFS in hadoop.
Go to that location/ path and type the command as below

The above command will give the details of the classes i.e, Mapper class(TokenizerMapper), Reducer class(IntSumReducer) and Driver class(WordCount).

Finally at the same path, type the below command..

Command is
#hadoop jar <RunnableJARName> <DriverClassNameWithout extension> <HDFSInputPathWithFilesName(s)> < HDFSOutputPath>

Input file:

Final Output

Hadoop System is a Master-Slave architecture. Hadoop 1.x file system has 64 MB block size.
The default replication factor in Hadoop is 3, which is configurable.

There are total 5 components in Hadoop 1.x architecture

1. Name Node (NN)
2. Data Node (DN)
3. Secondary Name Node (SNN)
4. Job Tracker (JT)
5. Task Tracker (TT)

Name Node, Data Node and Secondary Name Node are called as Storage components in Hadoop, whereas Job Tracker and Task Tracker are called as Processing components in Hadoop.

Name node is a Master Node in Hadoop
Data Node is the slave nodes in Hadoop.
Master Node's MR component is called as Job Tracker
Slave Node's MR component is called as Task Tracker.

Details of each of the components are as below:

x Name Node:

1. The job of name node is to decide, how to store the physical location of each and every file in blocks in the cluster
2. It also manages the metadata (data about the data) of the files stored in Data nodes.
3. Name node also decides, by combining which all physical locations in the data nodes, actual file will be generated.
4. Namenode always stores the Metadata in FSImage and EditLogs file at regular intervals.
This process is called as Checkpoint mechanism.

Note: The Name node is meant for maintaining the metadata information of the complete hadoop cluster, but it will never stores the actual data. The Actual data will always be stored in the Data node only.
Name Node is also called as Single Point of Failure ( SPOF).

Data Node:
Data node will store the file data in blocks, as per the instructions from Name node.
Data node will send the RPC signal at regular interval, to notify the Name node that it is alive and working.(called as Heart beat mechanism)

Secondary Name Node:
Secondary Name node acts as a backup for Name node.
It stores the meta data information from Name node at regular interval as per checkpoint mechanism.
When Name node goes down, in that case Secondary Name node comes into picture and act as a temporary Name node, till Name node becomes active again.

The working of the Secondary Name Node is described in details later blogs.

Job Tracker
1. Job Tracker most of the times resides in the same node as of Name Node.
2. It's job is to assign the task to the Data nodes/Task trackers
3. It also decides the job scheduling for the data nodes/Task trackers.
4. In case of Job failure, Job tracker decides about the rescheduling of the task on some other nodes.

Task Tracker
1. Task tracker's job is to execute the task assigned by Job tracker.

Note: The communication between Job Tracker and Task Tracker is via MR jobs only.

How does the components in Hadoop work collaboratively?
When client request the data from Hadoop System

When Hadoop system receives the client request, it is first received by the Master node.
Master node's MR component "Job Tracker" is responsible for receiving the client work and assigns the task to Task trackers, once divides the work into manageable independent task.
Slave node's MR component "Task Tracker" receives the tasks from "Job Tracker" and performs the work using MR.
Once all the Task trackers finished their work, JT takes those results and combines to produce the final result.
At last, Hadoop system sends the results back to clients.

Big Data - Hadoop

Friday, November 27, 2015

Save Actions in Eclipse editor

Hadoop Map-Reduce - WordCount example in detailed manner

Sunday, November 22, 2015

Components in Hadoop Architecture 1.x