inputdata = load 'Input-Big.txt' as (line:chararray);
words = FOREACH inputdata GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE group AS word , COUNT(filtered_words) AS count;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO 'PigWordCount';
The above pig script,
- Load the input file into variable inputdata
- Splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple.
- In the third statement, the words are filtered to remove any spaces in the file.
- In the fourth statement, the filtered words are grouped together so that the count can be computed which is done in fourth statement.
- In the fifth statement, the word has been counted.
- In the sixth statement, the result in being sorted as per count.
- At last the sorted list is saved into output folder named 'PigWordCount'.