Install Indexima Engine

When the prerequisites are done, you can install and configure the Indexima engine. There are two ways of installing Indexima:

Manually, by unzipping the archive downloaded in step 1: Downloading Indexima
With Ansible. We provide our own Ansible roles and playbook to users. It can be downloaded at: https://indexima.com/download/ansible/.

This page will cover the manual installation and configuration of Indexima.

For the sake of this guide, let's consider that we are doing an installation on a 2 node cluster, on Linux. We are assuming that you created a /opt/indexima folder and this will be the base directory we will work into.

If you are not in this exact configuration, you will need to adapt some commands to fit your environment.

Unzip the archive

You should have downloaded the archive in step one named indexima-installer-<version>.zip. We consider that you downloaded it in /opt/indexima.

Unzip the archive

BASH

cd /opt/indexima
unzip indexima-installer-1.7.7.1000.1.zip

Replace the version number with the one corresponding to your installation

This will create a "galactica" directory with everything needed in it.

Configure environment

Into galactica/conf folder, you will find a galactica-env.sh.template file. Rename it without the .template extension and open it with your favorite text editor

BASH

cd /opt/indexima/galactica/conf
mv galactica-env.sh.template galactica-env.sh
vi galactica-env.sh

From the top of file, let's talk about the options here.

First we have the JAVA_HOME variable. If Java is not in your PATH, you need to uncomment this line and fill it with the path of your JAVA_HOME. Most of the time, JAVA_HOME=/usr/lib/jvm/<java-version>

Next, you have a line with export GALACTICA_MEM=8000. This is the memory that will be allocated to the JVM. We suggest using 70% of the machine total RAM.

Standalone Deployment

Then we have the HADOOP_BASE variable. If you unzipped the Hadoop libraries in /opt/indexima as described earlier and used the version 2.8.3, the value of this parameter must be /opt/indexima/hadoop-2.8.3. If you downloaded another version or unzipped the archive in another location, you need to adapt the variable accordingly.

The three variables together should look something like this

galactica-env.sh

BASH

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_BASE=/opt/hadoop-2.8.3

Yarn deployment

When YARN is installed on the machine, you can comment the two variables HADOOP_BASE and HADOOP_JARS. Further down, you can just uncomment the other HADOOP_JARS variable. It should look like this

BASH

export HADOOP_JARS=$(yarn classpath)

This means that Indexima will use the same classpath as YARN for its Hadoop dependencies.

The three variables together should look something like this

galactica-env.sh

BASH

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_JARS=$(yarn classpath)

Configure Galactica

Still inside the galactica/conf folder, there is a file named galactica.conf.template. As with the environment file, rename it without the extension, and open it.

BASH

cd /opt/indexima/galactica/conf
mv galactica.conf.template galactica.conf
vi galactica.conf

There are a handful of parameters to change in order to start our 2 nodes cluster. We will only talk about the necessary ones. For more information about all the parameters and their uses, you can look into the Reference part of the documentation.

nodes

This is a list of the coma separated hostname/ip of the machines in your cluster. For example, if your two nodes respectively have 10.0.0.1 and 10.0.0.2 ips, the parameter will be

BASH

nodes = 10.0.0.1, 10.0.0.2

warehouse.protocol

This is the type of warehouse you are using to store Indexima data. For a full description of the different warehouses, you can look into the Warehouse part of the documentation.

Standalone Deployment

For the scope of this guide, we will use the LOCAL warehouse protocol. This means that the data will be split in two, and each node will be the sole owner of its part of the data.

This setting is not recommended in production, as it is absolutely not fault tolerant.

BASH

warehouse.protocol = LOCAL

Yarn Deployment

Since you are using a YARN deployment, we are assuming here that you want to use the HDFS you use for your Hadoop cluster. If this is not the case, switch to the Standalone Deployment in the top right corner just for this part.

BASH

warehouse.protocol = HDFS

warehouse

This is where the data will be stored by Indexima.

Standalone Deployment

For a Standalone deployment, you can chose any path available by the machine, and writable by the user that will start Indexima. Eg.

BASH

warehouse = /opt/indexima/warehouse

Copy everything to the other node:

In order to start Indexima, everything that we did needs to be done in the exact same way on the other node(s) of the cluster. One of the way to this is to do a rsync command once everything is setup properly on the master node.

You can read our documentation about Ansible deployment for more advanced cases.

Yarn Deployment

For a YARN deployment, since we are assuming that you want to use HDFS, the warehouse must be the full HDFS path to a location where the user starting Indexima has write access. Eg.

BASH

warehouse = hdfs://myhdfs.intra:8020/apps/indexima/warehouse

YARN parameters

You need to specify a few more parameters for a YARN deployment.

BASH

yarn.resourcemanager.hostname = myresourcemanager.intra
yarn.memory = 10000
yarn.dir = hdfs://myhdfs.intra:8020/tmp/indexima
yarn.name = Indexima

yarn.resourcemanager.hostname is just the hostname of your YARN resource manager. Don't include port number

yarn.memory is the RAM taken by every container. A good practice is to set this value to 10-20% more than the GALACTICA_MEM value set above, in the galactica-env.sh file

yarn.dir is the temporary hdfs folder where the edge node, where we installed everything, will push everything needed by the other nodes in order to start an Indexima cluster

yarn.name is the name of the application as it will appear in the Resource Manager UI

Hive-site

If you are using an Hadoop cluster, you probably already have a Hive server running. Since Indexima is a Hive engine, it uses the same 10000 default port. You will need to change this in order to avoid a conflict. In the galactica/conf folder, open the hive-site.xml file

hive-site

BASH

cd /opt/indexima/galactica/conf
vi hive-site.xml

At the top of the file, you will find a xml property named hive.server2.thrift.port. Change the value of this property to 10001.

hive-site.xml

XML

<property>
    <name>hive.server2.thrift.port</name>
    <value>10001</value>
</property>

Consistency Checks at Reboot (since 2021.2)

When starting the indexima service, the system performs some controls to check that everything is in place so the service correctly executes. If the service shuts down quickly, please check the logs to confirm what is missing in the Indexima deployment.

In particular, the following consistency checks are performed, and prevent the service from starting in case of failure:

Existence of mandatory configuration files (galactica.conf, hive-site.xml, log4j2.xml, notifications.json, optimize_index.json, supported_dbs.json)
Availability of working directories (directories listed in configuration, for example for the logs, for the shared warehouse, for temporary files,...)
Validity of critical configuration options (for example, control validity of number of cores or number of readers thread)

Other checks that will generate logs but not prevent the system from starting:

Usage of deprecated properties or unknown properties in configuration files