Once the prerequisites are fulfilled, you can install and configure the Indexima engine. There are two ways of installing Indexima:

  1. Manually, by unzipping the archive downloaded in step 1: Downloading Indexima
  2. With Ansible. We provide our own Ansible roles and playbook to users. It can be downloaded at: https://github.com/indexima-dev/ansible-indexima-install/tags.

This page will cover the manual installation and configuration of Indexima.

Let's consider that we are doing an installation on a 2 node cluster, on Linux. We are assuming that you created a /opt/indexima folder and this will be the base directory we will work into. If you are not in this exact configuration, you will need to adapt some commands to fit your environment.

Unzip Indexima archive

You should have downloaded the archive in step one named indexima-installer-galactica-hadoop3-<version>.zip. We consider that you downloaded it in /opt/indexima.

Unzip the archive

cd /opt/indexima
unzip indexima-installer-galactica-hadoop3-<version>.zip
BASH

Replace the version number with the one corresponding to your installation

This will create a "galactica" directory with everything needed in it.


Create configuration files from templates

Indexima configuration is available in /opt/indexima/galactica/conf. We provide templates for every configuration files required to run Indexima. We will now use those templates to initiate configuration.

cd /opt/indexima/galactica
cp galactica.conf.template galactica.conf
cp galactica-env.sh.template galactica-env.sh
cp hive-site.xml.template hive-site.xml
cp log4j2.xml.template log4j2.xml
cp supported_dbs.json.template supported_dbs.json
cp notifications.json.template notifications.json
cp optimize_index.json.template optimize_index.json
cp atlas-application.properties.template atlas-application.properties
BASH

Configure environment

Edit the file galactica-env.sh for your favorite text editor, to adapt the java environment variables.

JAVA_HOME: If Java is not already in your PATH, uncomment JAVA_HOME variable and fill it with the path of your JAVA_HOME. Depending on your platform, JAVA_HOME could be for example /usr/lib/jvm/<java-version>.

GALACTICA_MEM: This is the memory that will be allocated to the java virtual machine. We suggest setting this value to 70% of the machine's total RAM, for best performance. In a Yarn context, this value must be no more than 80% of the Yarn assigned memory.


Standalone Deployment

HADOOP_BASE: If you unzipped the Hadoop libraries in /opt/indexima as described earlier and used version 2.8.3, the value of this parameter must be /opt/indexima/hadoop-2.8.3. If you downloaded another version or unzipped the archive in another location, you need to adapt the variable accordingly.

The three variables together should look like this:

galactica-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_BASE=/opt/hadoop-2.8.3
BASH

Yarn Deployment

When YARN is installed on the machine, you can comment the two variables HADOOP_BASE and HADOOP_JARS. Further down, you can just uncomment the other HADOOP_JARS variable. It should look like this

export HADOOP_JARS=$(yarn classpath)
BASH

This means that Indexima will use the same classpath as YARN for its Hadoop dependencies.

The three variables together should look something like this

galactica-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export GALACTICA_MEM=8000m
export HADOOP_JARS=$(yarn classpath)
BASH

Configure Indexima core engine

Edit the file galactica.conf, to configure the Indexima core engine. Please note that most Indexima parameters are dynamic, and can be set directly in the Indexima console with a command (see galactica.conf for details about available parameters). We will customize only the static parameters in galactica.conf.

nodes parameters

The nodes parameters are used to define how Indexima nodes start and discover other Indexima nodes.

Standard dynamic setup

The cluster will be launched with command ./start-node.sh --attach xxx --host xxx

  • nodes.requested=Number of expected nodes. The cluster will start instantly if this number of nodes is attached, without waiting for more nodes to join.
  • nodes.connect.min-nodes=Minimum number of nodes that must be attached before the cluster starts.
  • nodes=[This parameter is unused in this setup]

Yarn dynamic setup

The cluster will be launched with command ./start-yarn.sh indexima

  • nodes.requested=Number of expected nodes. The cluster will start instantly if this number of nodes is attached, without waiting for more nodes to join.
  • nodes.connect.min-nodes=Minimum number of nodes that must be attached before the cluster starts.
  • nodes=[This parameter is unused in this setup]

Yarn static setup (deprecated)

The cluster will be launched with command ./start-yarn.sh indexima

  • nodes=List of nodes hostname/ip in the cluster. The first one is elected primary master node.
  • nodes.requested=[This parameter is unused in this setup]
  • nodes.connect.min-nodes=[This parameter is unused in this setup]


All Static setups have been deprecated.

warehouse.protocol

This is the kind of warehouse (shared storage between nodes) used to store Indexima data. For a full description of the different warehouses, you can read Storage Compatibility Matrix.

The warehouse.protocol can be HDFS (if a shared filesystem is available to store Indexima data), S3 (if an S3 bucket is available to store Indexima data), or LOCAL for any other shared filesystem. When using LOCAL, the shared filesystem (NFS, CEPH...) must be mounted to the path provided in the warehouse parameter.

warehouse parameters

The warehouse parameters contains the path to the shared filesystem used by Indexima nodes to store data. The details on configuration options are described in Setup shared storage

Standalone Deployment

Copy everything to other nodes: In order to start Indexima, everything that we did needs to be done in the exact same way on the other node(s) of the cluster. One of the ways to do this is to do a rsync command once everything is set up properly on the master node.

You can read our documentation about Ansible deployment for more advanced cases.

Yarn Deployment

You need to specify a few more parameters for a YARN deployment.

yarn.resourcemanager.hostname = myresourcemanager.intra
yarn.memory = 10000
yarn.dir = hdfs://myhdfs.intra:8020/tmp/indexima
yarn.name = Indexima
BASH

yarn.resourcemanager.hostname is just the hostname of your YARN resource manager. Don't include port number

yarn.memory is the RAM taken by every container. A good practice is to set this value to 10-20% more than the GALACTICA_MEM value set above, in the galactica-env.sh file

yarn.dir is the temporary hdfs folder where the edge node, where we installed everything, will push everything needed by the other nodes in order to start an Indexima cluster

yarn.name is the name of the application as it will appear in the Resource Manager UI

Hive-site

If you are using a Hadoop cluster, you probably already have a Hive server running. Since Indexima is a Hive engine, it uses the same 10000 default port. You will need to change this in order to avoid conflict. In the galactica/conf folder, edit the hive-site.xml file and change the value of the property named hive.server2.thrift.port to 10001.

hive-site.xml

<property>
    <name>hive.server2.thrift.port</name>
    <value>10001</value>
</property>
XML

Consistency Checks at Reboot

When starting the indexima service, the system performs some controls to check that everything is in place so the service correctly executes. If the service shuts down quickly, please check the logs to confirm what is missing in the Indexima deployment.

In particular, the following consistency checks are performed, and prevent the service from starting in case of failure:

  • Existence of mandatory configuration files (galactica.conf, hive-site.xml, log4j2.xml, notifications.json, optimize_index.json, supported_dbs.json)

  • Availability of working directories (directories listed in configuration, for example for the logs, for the shared warehouse, for temporary files,...)

  • Validity of critical configuration options (for example, control validity of number of cores or number of readers thread)

Other checks that will generate logs but not prevent the system from starting:

  • Usage of deprecated properties or unknown properties in configuration files