Saturday, January 16, 2016

Word count program in Pig




 inputdata = load 'Input-Big.txt' as (line:chararray);  
 words = FOREACH inputdata GENERATE FLATTEN(TOKENIZE(line)) AS word;  
 filtered_words = FILTER words BY word MATCHES '\\w+';  
 word_groups = GROUP filtered_words BY word;  
 word_count = FOREACH word_groups GENERATE group AS word , COUNT(filtered_words) AS count;  
 ordered_word_count = ORDER word_count BY count DESC;  
 STORE ordered_word_count INTO 'PigWordCount';  


The above pig script,

  • Load the input file into variable inputdata
  • Splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. 
  • In the third statement, the words are filtered to remove any spaces in the file.
  • In the fourth statement, the filtered words are grouped together so that the count can be computed which is done in fourth statement.
  • In the fifth statement, the word has been counted.
  • In the sixth statement, the result in being sorted as per count.
  • At last the sorted list is saved into output folder named 'PigWordCount'.






5 comments:

  1. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here
    Thank you. Your blog was very helpful and efficient For Me,Thanks for Sharing the information Regards..!!..Big Data Hadoop Online Training

    ReplyDelete
  2. Useful Blog and a very useful post!
    Thanks for sharing !!
    Big data Training in Bangalore

    ReplyDelete
  3. Hey There. I found your blog using msn. This is a very well written article. I’ll be sure to bookmark it and come back to read more of your useful info. Thanks for the post. I’ll definitely return. view

    ReplyDelete
  4. Cool stuff you have got and you keep update all of us. word counter

    ReplyDelete