Hi all,
Lets have a simple example for saving the Books XML data into Hive external tables.
Sample file is attached at the end of the blog.
Books data is present in XML format as below.
1. Create External table to store the XML file data.
2. Create the external table to store the Books data. Columns are aligned to Books.xml tags, shown above.
Books.xml file is present at /usr/lib/hive path in LFS
1 To load this data into BooksData table, we have
to make sure the data is formatted as per hive requirement.
a.
Each tag should be in same line.
b.
<?xml version="1.0"?> should not
be present.
c.
As we require only books data, no need of catalog tags.
d.
Each <book></books> should be in
single line.
To
achieve this we will be using UNIX command as below.
In UNIX command, | is used for
chaining/cascading the operation to next command.
e.g., output of cat Books.xml will be fed
to sed ‘1d’. Output of sed ‘1d’ will be fed to sed ‘s|<catalog>||g’, so
on.
What does each command in above UNIX command do?
What does each command in above UNIX command do?
- cat Books.xml ==> Populates the file Books.xml
- sed ‘1d’ ==> This command will delete the first line in Books.xml.
- sed ‘s|<catalog>||g’ ==> This command will replace all the occurrences of <catalog> tag with Empty string.
- sed ‘s|</catalog>||g’ ==> This command will replace all the occurrences of </catalog> tag with Empty string.
- tr ‘\n\r\t’ ‘ ‘ ==> This command will replace all the occurrence of \n, \t, \r with Empty string.
- sed ‘s|</book>|</book>\n|g’ ==> This command will add the new line character after each </book> tag.
- sed ‘s/^ *//;s/ *$//;s/ */ /;’ ==> This command will remove all the leading and trailing spaces in the file.
At
last ‘> BooksFormatted.xml’ will save the final output as below.
Load the data from BooksFormatted.xml into XMLData table
Insert the formatted data from xmldata table into BooksData table.
Check the data in BooksData table.
XML Data saved into Hive external tables successfully.
Bash Script Execution for moving the XML data into Hive external tables
HiveShell.sh
Execution of the script
Books.xml is attached here.
Sample Books.xml
Load the data from BooksFormatted.xml into XMLData table
Insert the formatted data from xmldata table into BooksData table.
Check the data in BooksData table.
XML Data saved into Hive external tables successfully.
Bash Script Execution for moving the XML data into Hive external tables
HiveShell.sh
hive_path='/Hadoop/HiveExample'
sample_file=Books.xml
hive -e 'create database IF NOT EXISTS HiveExample;'
echo "1. Database created"
hive -e 'use HiveExample;'
hive -e "create external table IF NOT EXISTS HiveExample.BooksXML(XMLAsString String) row format delimited stored as textfile location '/Hadoop/HiveExample';"
echo "2. External table BooksXML created"
hive -e "create external table IF NOT EXISTS HiveExample.BooksDetails(BookId string, Author string, Title string, Genre string, price string, PublishedOn string, Description string) row format delimited stored as textfile location '/Hadoop/HiveExample';"
echo "3. External table BooksDetails created"
cat $sample_file | sed '1d' | sed 's|<catalog>||g' | sed 's|</catalog>||g' | tr '\n\r\t' ' ' | sed 's|</book>|</book>\n|g' | sed 's/^ *//;s/ *$//;s/ */ /;' > BooksHiveFormatted.xml
echo "4. Data from Books XML formatted as per Hive requirement"
hive -e "load data local inpath '/usr/lib/hive/BooksHiveFormatted.xml' overwrite into table HiveExample.BooksXML;"
echo "5. Data inserted from Formatted XML into BooksXML table"
hive -e "insert overwrite table HiveExample.BooksDetails select xpath_string(XMLAsString, 'book/@id'),xpath_string(XMLAsString, 'book/author'),xpath_string(XMLAsString, 'book/title'),xpath_string(XMLAsString, 'book/genre'),xpath_string(XMLAsString, 'book/price'),xpath_string(XMLAsString, 'book/publish_date'),xpath_string(XMLAsString, 'book/description') from HiveExample.BooksXML;"
echo "6. Inserted the data from BookXML table to BooksDetails table"
hive -e "select BookId, Author, price from HiveExample.BooksDetails;"
Books.xml is attached here.
Sample Books.xml
Nice post ! Thanks for sharing valuable information with us. Keep sharing..Big data hadoop online Course Hyderabad
ReplyDeleteYour new valuable key points imply much a person like me and extremely more to my office workers. With thanks; from every one of us.
ReplyDeleteBest PHP Training Institute in Chennai|PHP Course in chennai
Best .Net Training Institute in Chennai
Software Testing Training in Chennai
Blue Prism Training in Chennai
Angularjs Training in Chennai
Great article,keep sharing more posts with us.
ReplyDeletethank you..
hadoop admin course
big data and hadoop course