Sunday, July 30, 2023

Hive Commands



Apache Hive has two kinds of table
1. Managed tables
2. External tables

1. Manage Tables - These are the tables which is getting populated from the files in local file system.

2. External Tables - These are the tables which is getting populated from files in HDFS path.

After inserting the data into Managed tables or External tables, the data will be part of HDFS only.



Below is the list of commands used in Hive for different purposes.

Create table Commands

Managed Table Creation

CREATE table <Table Name> <Schema definitions> 
row format delimited
fields terminated by <delimiter>
lines terminated by <delimiter>
stored as <File format>;

Example:


Data file in LFS has lines as below:

1, Pranav, 12000
2, Ashish, 13000
3, Nilesh, 14000
4, Ram, 15000
5, Karan, 16000

As per the data, create table statement would be as below:


 Create table ManagedTable(id int, name String, salary int)  
 row format delimited  
 fields terminated by ','  
 lines terminated by '\n'  -- not a mandatory statement
 stored as textfile;  


External Table Creation

CREATE external table <Table Name> <Schema definitions> 
row format delimited
fields terminated by <delimiter>
lines terminated by <delimiter>
stored as <File format> location <HDFS Path>;

Note: The difference between the syntax of External table and Managed table is of 2 keywords namely external and location


Load data into Hive tables







WordCount Program to show Top 10 searched words in PySpark

from pyspark import SparkContext

sc = SparkContext("local[*]", "wordcount")
sc.setLogLevel("ERROR")
input = sc.textFile("C:/Users/pranavwagde/Downloads/DE/BigData/search_data.txt")

# This will divide the file's data into multiple words, one below another
words = input.flatMap(lambda x: x.split(" "))

# convert each word to K,V format to count effectively
wordCounts = words.map(lambda x:(x,1))

# count the occurrences of each word
finalCount = wordCounts.reduceByKey(lambda x,y:(x+y))

# Sort the output by the no of occurrences, to get the words with the highest no of appearances
sortedOrder = finalCount.sortBy(lambda x: x[1],False)

# Take the top 10 searched words
top10MostSearchedWords = sortedOrder.take(10)

# Print the list
for word in top10MostSearchedWords:
print(word)



Courtesy: Frank Kane Udemy Course