Sunday, December 20, 2015

Populating XML Data into Hive External tables


Hi all,

Lets have a simple example for saving the Books XML data into Hive external tables. 

Sample file is attached at the end of the blog.

Books data is present in XML format as below.












1. Create External table to store the XML file data.











2. Create the external table to store the Books data. Columns are aligned to Books.xml tags, shown above.













Books.xml file is present at /usr/lib/hive path in LFS























1    To load this data into BooksData table, we have to make sure the data is formatted as per hive requirement.
a.      Each tag should be in same line.
b.      <?xml version="1.0"?> should not be present.
c.      As we require only books data, no need of catalog tags.
d.      Each <book></books> should be in single line.

To achieve this we will be using UNIX command as below.






In UNIX command, | is used for chaining/cascading the operation to next command.
e.g., output of cat Books.xml will be fed to sed ‘1d’. Output of sed ‘1d’ will be fed to sed ‘s|<catalog>||g’, so on.

What does each command in above UNIX command do?
  • cat Books.xml ==> Populates the file Books.xml
  • sed ‘1d’ ==> This command will delete the first line in Books.xml.
  • sed ‘s|<catalog>||g’ ==> This command will replace all the occurrences of               <catalog> tag with Empty string.
  • sed ‘s|</catalog>||g’ ==> This command will replace all the occurrences of </catalog> tag with Empty string.
  • tr ‘\n\r\t’ ‘ ‘ ==> This command will replace all the occurrence of \n, \t, \r with Empty string.
  • sed ‘s|</book>|</book>\n|g’ ==> This command will add the new line character after each </book> tag.
  • sed ‘s/^ *//;s/ *$//;s/  */ /;’ ==> This command will remove all the leading and trailing spaces in the file.
At last ‘> BooksFormatted.xml’ will save the final output as below.











Load the data from BooksFormatted.xml into XMLData table








Insert the formatted data from xmldata table into BooksData table.



















Check the data in BooksData table.

















XML Data saved into Hive external tables successfully.



Bash Script Execution for moving the XML data into Hive external tables

HiveShell.sh
 hive_path='/Hadoop/HiveExample'  
 sample_file=Books.xml  
 hive -e 'create database IF NOT EXISTS HiveExample;'  
 echo "1. Database created"  
 hive -e 'use HiveExample;'  
 hive -e "create external table IF NOT EXISTS HiveExample.BooksXML(XMLAsString String) row format delimited stored as textfile location '/Hadoop/HiveExample';"  
 echo "2. External table BooksXML created"  
 hive -e "create external table IF NOT EXISTS HiveExample.BooksDetails(BookId string, Author string, Title string, Genre string, price string, PublishedOn string, Description string) row format delimited stored as textfile location '/Hadoop/HiveExample';"  
 echo "3. External table BooksDetails created"  
 cat $sample_file | sed '1d' | sed 's|<catalog>||g' | sed 's|</catalog>||g' | tr '\n\r\t' ' ' | sed 's|</book>|</book>\n|g' | sed 's/^ *//;s/ *$//;s/ */ /;' > BooksHiveFormatted.xml  
 echo "4. Data from Books XML formatted as per Hive requirement"  
 hive -e "load data local inpath '/usr/lib/hive/BooksHiveFormatted.xml' overwrite into table HiveExample.BooksXML;"  
 echo "5. Data inserted from Formatted XML into BooksXML table"  
 hive -e "insert overwrite table HiveExample.BooksDetails select xpath_string(XMLAsString, 'book/@id'),xpath_string(XMLAsString, 'book/author'),xpath_string(XMLAsString, 'book/title'),xpath_string(XMLAsString, 'book/genre'),xpath_string(XMLAsString, 'book/price'),xpath_string(XMLAsString, 'book/publish_date'),xpath_string(XMLAsString, 'book/description') from HiveExample.BooksXML;"  
 echo "6. Inserted the data from BookXML table to BooksDetails table"  
 hive -e "select BookId, Author, price from HiveExample.BooksDetails;"  






Execution of the script

















































Books.xml is attached here.
Sample Books.xml

3 comments: