Big Data - Hadoop: January 2016

 inputdata = load 'Input-Big.txt' as (line:chararray);  
 words = FOREACH inputdata GENERATE FLATTEN(TOKENIZE(line)) AS word;  
 filtered_words = FILTER words BY word MATCHES '\\w+';  
 word_groups = GROUP filtered_words BY word;  
 word_count = FOREACH word_groups GENERATE group AS word , COUNT(filtered_words) AS count;  
 ordered_word_count = ORDER word_count BY count DESC;  
 STORE ordered_word_count INTO 'PigWordCount';

The above pig script,

Load the input file into variable inputdata
Splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple.
In the third statement, the words are filtered to remove any spaces in the file.
In the fourth statement, the filtered words are grouped together so that the count can be computed which is done in fourth statement.
In the fifth statement, the word has been counted.
In the sixth statement, the result in being sorted as per count.
At last the sorted list is saved into output folder named 'PigWordCount'.

Big Data - Hadoop

Saturday, January 16, 2016

Word count program in Pig