When the prerequisites are done, you can install and configure the Indexima engine. There are two ways of installing Indexima:

  1. Manually, by unzipping the archive downloaded in step 1: Downloading Indexima
  2. With Ansible. We provide our own Ansible roles and playbook to users. It can be downloaded at: https://indexima.com/download/ansible/.

This page will cover the manual installation and configuration of Indexima.

For the sake of this guide, let's consider that we are doing an installation on a 2 node cluster, on Linux. We are assuming that you created a /opt/indexima folder and this will be the base directory we will work into.

If you are not in this exact configuration, you will need to adapt some commands to fit your environment.

Unzip the archive

You should have downloaded the archive in step one named indexima-installer-<version>.zip. We consider that you downloaded it in /opt/indexima.

Unzip the archive

cd /opt/indexima
unzip indexima-installer-1.7.7.1000.1.zip
BASH

Replace the version number with the one corresponding to your installation

This will create a "galactica" directory with everything needed in it.

Configure environment

Into galactica/conf folder, you will find a galactica-env.sh.template file. Rename it without the .template  extension and open it with your favorite text editor

cd /opt/indexima/galactica/conf
mv galactica-env.sh.template galactica-env.sh
vi galactica-env.sh
BASH

From the top of file, let's talk about the options here.

First we have the JAVA_HOME variable. If Java is not in your PATH, you need to uncomment this line and fill it with the path of your JAVA_HOME. Most of the time, JAVA_HOME=/usr/lib/jvm/<java-version>

Next, you have a line with export GALACTICA_MEM=8000. This is the memory that will be allocated to the JVM. We suggest using 70% of the machine total RAM.


Then we have the HADOOP_BASE variable. If you unzipped the Hadoop libraries in /opt/indexima as described earlier and used the version 2.8.3, the value of this parameter must be /opt/indexima/hadoop-2.8.3. If you downloaded another version or unzipped the archive in another location, you need to adapt the variable accordingly.

The three variables together should look something like this

galactica-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_BASE=/opt/hadoop-2.8.3
BASH

When YARN is installed on the machine, you can comment the two variables HADOOP_BASE and HADOOP_JARS. Further down, you can just uncomment the other HADOOP_JARS variable. It should look like this

export HADOOP_JARS=$(yarn classpath)
BASH

This means that Indexima will use the same classpath as YARN for its Hadoop dependencies.

The three variables together should look something like this

galactica-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_JARS=$(yarn classpath)
BASH

Configure Galactica

Still inside the galactica/conf folder, there is a file named galactica.conf.template. As with the environment file, rename it without the extension, and open it.

cd /opt/indexima/galactica/conf
mv galactica.conf.template galactica.conf
vi galactica.conf
BASH

There are a handful of parameters to change in order to start our 2 nodes cluster. We will only talk about the necessary ones. For more information about all the parameters and their uses, you can look into the Reference part of the documentation.

nodes

This is a list of the coma separated hostname/ip of the machines in your cluster. For example, if your two nodes respectively have 10.0.0.1 and 10.0.0.2 ips, the parameter will be

nodes = 10.0.0.1, 10.0.0.2
BASH

warehouse.protocol

This is the type of warehouse you are using to store Indexima data. For a full description of the different warehouses, you can look into the Warehouse part of the documentation.

For the scope of this guide, we will use the LOCAL warehouse protocol. This means that the data will be split in two, and each node will be the sole owner of its part of the data.

This setting is not recommended in production, as it is absolutely not fault tolerant.

warehouse.protocol = LOCAL
BASH

Since you are using a YARN deployment, we are assuming here that you want to use the HDFS you use for your Hadoop cluster. If this is not the case, switch to the Standalone Deployment in the top right corner just for this part.

warehouse.protocol = HDFS
BASH

warehouse

This is where the data will be stored by Indexima.

For a Standalone deployment, you can chose any path available by the machine, and writable by the user that will start Indexima. Eg.

warehouse = /opt/indexima/warehouse
BASH

Copy everything to the other node

In order to start Indexima, everything that we did needs to be done in the exact same way on the other node(s) of the cluster. One of the way to this is to do a rsync command once everything is setup properly on the master node.

You can read our documentation about Ansible deployment for more advanced cases.


For a YARN deployment, since we are assuming that you want to use HDFS, the warehouse must be the full HDFS path to a location where the user starting Indexima has write access. Eg.

warehouse = hdfs://myhdfs.intra:8020/apps/indexima/warehouse
BASH

YARN parameters

You need to specify a few more parameters for a YARN deployment.

yarn.resourcemanager.hostname = myresourcemanager.intra
yarn.memory = 10000
yarn.dir = hdfs://myhdfs.intra:8020/tmp/indexima
yarn.name = Indexima
BASH

yarn.resourcemanager.hostname is just the hostname of your YARN resource manager. Don't include port number

yarn.memory is the RAM taken by every container. A good practice is to set this value to 10-20% more than the GALACTICA_MEM value set above, in the galactica-env.sh file

yarn.dir is the temporary hdfs folder where the edge node, where we installed everything, will push everything needed by the other nodes in order to start an Indexima cluster

yarn.name is the name of the application as it will appear in the Resource Manager UI

Hive-site

If you are using an Hadoop cluster, you probably already have a Hive server running. Since Indexima is a Hive engine, it uses the same 10000 default port. You will need to change this in order to avoid a conflict. In the galactica/conf folder, open the hive-site.xml file

hive-site

cd /opt/indexima/galactica/conf
vi hive-site.xml
BASH

At the top of the file, you will find a xml property named hive.server2.thrift.port. Change the value of this property to 10001.

hive-site.xml

<property>
    <name>hive.server2.thrift.port</name>
    <value>10001</value>
</property>
XML