Install Indexima Engine

Once the general requirements are fulfilled, you can install and configure the Indexima engine. There are two ways of installing Indexima:

Manually, by unzipping the archive downloaded in step 1: Downloading Indexima
With Ansible. We provide our own Ansible roles and playbook to users. It can be downloaded at: https://bitbucket.org/indexima/ansible-indexima-install.

This page will cover the manual installation and configuration of Indexima.

Let's consider that we are doing an installation on a 2 node cluster, on Linux. We are assuming that you created a /opt/indexima folder and this will be the base directory we will work into. If you are not in this exact configuration, you will need to adapt some commands to fit your environment.

Unzip Indexima archive

You should have downloaded the archive in step one named indexima-installer-galactica-hadoop3-<version>.zip. We consider that you downloaded it in /opt/indexima.

Unzip the archive

BASH

cd /opt/indexima
unzip indexima-installer-galactica-hadoop3-<version>.zip

Replace the version number with the one corresponding to your installation

This will create a "galactica" directory with everything needed in it.

Create configuration files from templates

Indexima configuration is available in /opt/indexima/galactica/conf. We provide templates for every configuration files required to run Indexima. We will now use those templates to initiate configuration.

BASH

cd /opt/indexima/galactica
cp galactica.conf.template galactica.conf
cp galactica-env.sh.template galactica-env.sh
cp hive-site.xml.template hive-site.xml
cp log4j2.xml.template log4j2.xml
cp supported_dbs.json.template supported_dbs.json
cp notifications.json.template notifications.json
cp optimize_index.json.template optimize_index.json
cp atlas-application.properties.template atlas-application.properties

Configure environment

Edit the file galactica-env.sh for your favorite text editor, to adapt the java environment variables.

JAVA_HOME: If Java is not already in your PATH, uncomment JAVA_HOME variable and fill it with the path of your JAVA_HOME. Depending on your platform, JAVA_HOME could be for example /usr/lib/jvm/<java-version>.

GALACTICA_MEM: This is the memory that will be allocated to the java virtual machine. We suggest setting this value to 70% of the machine's total RAM, for best performance. In a Yarn context, this value must be no more than 80% of the Yarn assigned memory.

Standalone Deployment

HADOOP_BASE: If you unzipped the Hadoop libraries in /opt/indexima as described earlier and used version 2.8.3, the value of this parameter must be /opt/indexima/hadoop-2.8.3. If you downloaded another version or unzipped the archive in another location, you need to adapt the variable accordingly.

The three variables together should look like this:

galactica-env.sh

BASH

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_BASE=/opt/hadoop-2.8.3

Yarn Deployment

When YARN is installed on the machine, you can comment the two variables HADOOP_BASE and HADOOP_JARS. Further down, you can just uncomment the other HADOOP_JARS variable. It should look like this

BASH

export HADOOP_JARS=$(yarn classpath)

This means that Indexima will use the same classpath as YARN for its Hadoop dependencies.

The three variables together should look something like this

galactica-env.sh

BASH

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_JARS=$(yarn classpath)

Copy additional hadoop libraries

If deploying Indexima with a hadoop 3.x version, you will need to copy the tez libraries described in the general requirements, to the galactica/tez folder. Without those libraries, Indexima startup will be much slower and logs will contain error messages from missing tez libraries.

BASH

mkdir -p /opt/indexima/galactica/tez
cd /opt/indexima/galactica/tez
curl -O https://repo1.maven.org/maven2/org/apache/tez/tez-api/0.9.2/tez-api-0.9.2.jar
curl -O https://repo1.maven.org/maven2/org/apache/tez/tez-dag/0.9.2/tez-dag-0.9.2.jar

Configure Indexima core engine

Edit the file galactica.conf, to configure the Indexima core engine. Please note that most Indexima parameters are dynamic, and can be set directly in the Indexima console with a command (see galactica.conf for details about available parameters). We will customize only the static parameters in galactica.conf.

nodes parameters

The nodes parameters are used to define how Indexima nodes start and discover other Indexima nodes. Those parameters are valid for all Indexima deployments (Standalone, Kubernetes or Yarn).

nodes.requested=Number of expected nodes. The cluster will start instantly if this number of nodes is attached, without waiting for more nodes to join.
nodes.connect.min-nodes=Minimum number of nodes that must be attached before the cluster starts.

With a standard dynamic setup, the cluster will be launched with command ./start-node.sh --attach xxx --host xxx

With a Yarn dynamic setup, the cluster will be launched with command ./start-yarn.sh indexima

The Static setup (listing the nodes directly in galactica.conf) are not supported anymore since 2022.1

warehouse parameters

The warehouse parameters contains the path to the shared filesystem used by Indexima nodes to store data. The details on configuration options are described in Setup shared storage.

Standalone Deployment

Copy everything to other nodes: In order to start Indexima, everything that we did needs to be done in the exact same way on the other node(s) of the cluster. One of the ways to do this is to do a rsync command once everything is set up properly on the master node.

You can read our documentation about Ansible deployment for more advanced cases.

YARN parameters

You need to specify a few more parameters for a YARN deployment.

BASH

yarn.resourcemanager.hostname = myresourcemanager.intra
yarn.memory = 10000
yarn.dir = hdfs://myhdfs.intra:8020/tmp/indexima
yarn.name = Indexima

yarn.resourcemanager.hostname is just the hostname of your YARN resource manager. Don't include port number

yarn.memory is the RAM taken by every container. A good practice is to set this value to 10-20% more than the GALACTICA_MEM value set above, in the galactica-env.sh file

yarn.dir is the temporary hdfs folder where the edge node, where we installed everything, will push everything needed by the other nodes in order to start an Indexima cluster

yarn.name is the name of the application as it will appear in the Resource Manager UI

Hive-site

If you are using a Hadoop cluster, you probably already have a Hive server running. Since Indexima is a Hive engine, it uses the same 10000 default port. You will need to change this in order to avoid conflict. In the galactica/conf folder, edit the hive-site.xml file and change the value of the property named hive.server2.thrift.port to 10001.

hive-site.xml

XML

<property>
    <name>hive.server2.thrift.port</name>
    <value>10001</value>
</property>

Consistency Checks at Reboot

When starting the indexima service, the system performs some controls to check that everything is in place so the service correctly executes. If the service shuts down quickly, please check the logs to confirm what is missing in the Indexima deployment.

In particular, the following consistency checks are performed, and prevent the service from starting in case of failure:

Existence of mandatory configuration files (galactica.conf, hive-site.xml, log4j2.xml, notifications.json, optimize_index.json, supported_dbs.json)
Availability of working directories (directories listed in configuration, for example for the logs, for the shared warehouse, for temporary files,...)
Validity of critical configuration options (for example, control validity of number of cores or number of readers thread)

Other checks that will generate logs but not prevent the system from starting:

Usage of deprecated properties or unknown properties in configuration files