inputdata = load 'Input-Big.txt' as (line:chararray);
words = FOREACH inputdata GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE group AS word , COUNT(filtered_words) AS count;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO 'PigWordCount';
The above pig script,
- Load the input file into variable inputdata
- Splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple.
- In the third statement, the words are filtered to remove any spaces in the file.
- In the fourth statement, the filtered words are grouped together so that the count can be computed which is done in fourth statement.
- In the fifth statement, the word has been counted.
- In the sixth statement, the result in being sorted as per count.
- At last the sorted list is saved into output folder named 'PigWordCount'.
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here
ReplyDeleteThank you. Your blog was very helpful and efficient For Me,Thanks for Sharing the information Regards..!!..Big Data Hadoop Online Training
Useful Blog and a very useful post!
ReplyDeleteThanks for sharing !!
Big data Training in Bangalore
Hey There. I found your blog using msn. This is a very well written article. I’ll be sure to bookmark it and come back to read more of your useful info. Thanks for the post. I’ll definitely return. view
ReplyDeleteNice article,thank you..
ReplyDeletebig data and hadoop course
Cool stuff you have got and you keep update all of us. word counter
ReplyDelete