Saturday, January 16, 2016

Word count program in Pig




 inputdata = load 'Input-Big.txt' as (line:chararray);  
 words = FOREACH inputdata GENERATE FLATTEN(TOKENIZE(line)) AS word;  
 filtered_words = FILTER words BY word MATCHES '\\w+';  
 word_groups = GROUP filtered_words BY word;  
 word_count = FOREACH word_groups GENERATE group AS word , COUNT(filtered_words) AS count;  
 ordered_word_count = ORDER word_count BY count DESC;  
 STORE ordered_word_count INTO 'PigWordCount';  


The above pig script,

  • Load the input file into variable inputdata
  • Splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. 
  • In the third statement, the words are filtered to remove any spaces in the file.
  • In the fourth statement, the filtered words are grouped together so that the count can be computed which is done in fourth statement.
  • In the fifth statement, the word has been counted.
  • In the sixth statement, the result in being sorted as per count.
  • At last the sorted list is saved into output folder named 'PigWordCount'.