Skip to main content
Skip table of contents

Synchronize data from Hive datasource

Overview

When the INDEXIMA table is created with the following SQL query:

SQL
CREATE TABLE my_dataspace FROM my_hive_source IN 'jdbc:hive2://my_hive_server:10000' (INDEX (my_index));

INDEXIMA reads the table structure and creates a dataspace with the same or equivalent data types.

The following SQL command is used to detect differences between partitions in the HIVE source database and the INDEXIMA dataspace. This command will trigger a data load to keep both tables synchronized.

If a HIVE source partition is missing in Indexima, Indexima Data Engine automatically loads such partition into the appropriate table. The synchronize command includes a commit command:

SQL
SYNCHRONIZE my_data_space

Synchronize command is not auto-sync

Indexima doesn't follow/listen to changes in Hive Databases. 

This means the SYNCHRONIZE commands must be sent after every modification in the Hive datasource that needs to be replicated in Indexima.

Synchronizing selected partitions from Hive source

Prerequisite: In your core-site.xml, the warehouse URI for the property fs.defaultFS must be the same warehouse as the Hive's one.

INDEXIMA allows synchronizing only selected partitions of a HIVE source created with a set of partition into an INDEXIMA dataspace

The HIVE dataspace/table must already be partitioned: it must be created with the keyword PARTITIONED BY as shown below:

Below are some examples, based on 3 columns partitions, namely year, month, day:

SQL
CREATE TABLE my_hiveTable (
    id           INT,
    name         STRING,
    length         STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
...;
  • Check if new rows need to be synchronized in partitions corresponding to the date of 2nd April 2016:
SQL
SYNCHRONIZE my_data_space PARTITION (year='2016', month='4', day='2')
  • Check if new rows need to be synchronized in partitions corresponding to the month of May 2016:
SQL
SYNCHRONIZE my_data_space PARTITION (year='2016', month='5')
  • Check if new rows need to be synchronized in partitions corresponding to the year 2016:
SQL
SYNCHRONIZE my_data_space PARTITION (year='2016')

Synchronizing with transactional/ACID tables

If the Hive table is a transactional/ACID table, Indexima ignores all changes not committed.
The Hive base files will be synchronized with the base and delta files will be ignored.

NO_CLEANUP property

Synchronize commands in Hive temporarily generate 2 base directories.

In order to properly synchronize in Indexima, it is recommended to set NO_CLEANUP to false in the table properties as shown in the following command to enable the table-level cleanup and in order to clean obsolete directories/files.

CODE
ALTER TABLE <tablename> SET TBLPROPERTIES('NO_CLEANUP'=FALSE);

If the user wants to be sure there is only one base directory per partition, he can check with the following command.

hdfs dfs -ls -R /path/myschema.db/mytable | grep dr | grep base | wc -l

AND WAIT compaction option

In the case of partition modifications in HIVE, it is required to compact (with option=major) in order to get all modifications in the base file so that the Synchronize command in Indexima will work.

We recommend adding the option "AND WAIT" that blocks the operation until compaction completes.

See hive documentation for more details.

Synchronizing the Hive data source via Zookeeper

Hortonwork Hadoop ecosystem requires some additional jar libraries to be fully functional with Zookeeper:

Apply the following procedure to adapt INDEXIMA to such an environment:

Download the following jar to comply with Hortonworks ecosystem and unzip it: Hive thrift driver

  1. Once started, you can synchronize a HIVE data source pointed by ZooKeeper.Stop your INDEXIMA CLUSTER

    cd galactica
    ./stop.sh
    

    Run multiple times the script stop.sh to ensure that there is no more connection

  2. Remove 2 jars from galactica/lib directory

    cd galactica/lib/
    rm hive-service-thrift-1.2.1.*.jar
    rm hive-jdbc-1.2.1.jar
    
  3. Copy the 2 new jars in galactica/lib directory

    cd galactica/lib/
    cp $HOME/hive-service-thrift-HW-2.6.5.jar .
    cp  /usr/hdp/current/hive-client/lib/hive-jdbc.jar .
    
  4. Restart your INDEXIMA CLUSTER

    cd galactica
    nohup ./start-yarn.sh &
    yarn application -list
    
SQL
DROP TABLE IF EXISTS testzookeper; 
CREATE DIMENSION TABLE testzookeeper FROM sourcehive 
IN 'jdbc:hive2://ns3615.co:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' ();
synchronize testzookeeper;
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.