Oct 11, 2012

Hadoop Hive: How to keep your data safe

Usually you keep a lot of useful data in your hadoop cluster. You really doesn't want to lose it. Apache Hive is a very useful tool to access and query that data. The schema definition is required for Hive to parse data stored and a files on HDFS. So in a nutshell you have to create a Hive table on top of your files. It is possible to drop this table incidentally. By default all data files will be deleted also. It can cause a great deal of problem to your data warehouse. Unfortunately Hive doesn't have sufficient security to protect you data. On the other hand it has some facilities that could (and should) be used.


 First of all external tables should be used for all important and hard-to-restore data. Data files of those tables will easily survive dropping of the table itself. Such table could be restored even if it is partitioned. A very simple script can re-attache all partitions back to the table.

Second rule is "no_drop" option of the table should be enabled. So the definition of fully protected table will look like this:


create external table if not exists IMPORTANT_HIVE_TABLE (
  natural_key string
, 
)
row format delimited
stored as sequencefile
location '/DATA/team_name/schema_name/important_hive_table'
;

alter table IMPORTANT_HIVE_TABLE enable no_drop;
If you finally decided to get rid of this table and free all used space in the cluster then the following sequence of commands is required:
hive -e "use schema_name; alter table IMPORTANT_HIVE_TABLE disable no_drop; 
drop table IMPORTANT_HIVE_TABLE;"
hadoop fs -rsr '/DATA/team_name/schema_name/important_hive_table'

No comments: