Monday 16 June 2014

Brandscreen Technology: Hive

Hive is a technology that sits on top of Hadoop which allows the searching of text files using SQL  statements.

org.apache.hadoop.hive.contrib.serde2.RegexSerDe in hive-contrib may be used to process irregular text files (e.g. webserver logs).

Hive can directly use CSV files and can directly use s3, which is particularly useful when staying inside the AWS environment.

Key steps to running a query are:

  1. Files in s3 should be laid out as <root>/<top key>/<partition key1=value1>/<partition key2=value2>...
    1. e.g. s3://test/logs/dt=2014-06-16/hr=15
  2. Startup up EC2 for Amazon, login using SSH to hadoop@<master node name> using the pem key you set up for Amazon
  3. 'Add' any external modules (such as hive-contrib) if required
  4. CREATE EXTERNAL TABLE with, specifying:
    1. PARTITIONED BY with partitioning matching partition 1... in step 1
      1. e.g. PARTITIONED BY (dt string, hr int)
    2. ROW FORMAT as:
      1. DELIMITED FIELDS
      2. SERDE (with regex)
  5. ALTER TABLE xxx ADD IF NOT EXISTS PARTITION (key1=partition1...);
    1. e.g. ALTER TABLE abc ADD IF NOT EXISTS PARTITION (dt=2014-06-16, hr=15);
    2. A separate ALTER TABLE statement must be provided for each partition required
  6. Query all the things! Just the same as a normal SQL query
The advantages of Hive are that you don't need to import your log files into a database in order to query them. The disadvantage of Hive is that although it has good throughput it has very poor latency. Some tools such as Shark may be able to mitigate this problem.

No comments:

Post a Comment