Once the general requirements are fulfilled, you can install and configure the Indexima engine. There are two ways of installing Indexima:
Manually, by unzipping the archive downloaded in step 1: Downloading Indexima
With Ansible. We provide our own Ansible roles and playbook to users. It can be downloaded at: https://bitbucket.org/indexima/ansible-indexima-install.
This page will cover the manual installation and configuration of Indexima.
Let's consider that we are doing an installation on a 2 node cluster, on Linux. We are assuming that you created a /opt/indexima folder and this will be the base directory we will work into. If you are not in this exact configuration, you will need to adapt some commands to fit your environment.
Unzip Indexima archive
You should have downloaded the archive in step one named indexima-installer-galactica-hadoop3-<version>.zip. We consider that you downloaded it in /opt/indexima.
Unzip the archive
cd /opt/indexima unzip indexima-installer-galactica-hadoop3-<version>.zip
Replace the version number with the one corresponding to your installation
This will create a "galactica" directory with everything needed in it.
Create configuration files from templates
Indexima configuration is available in /opt/indexima/galactica/conf. We provide templates for every configuration files required to run Indexima. We will now use those templates to initiate configuration.
cd /opt/indexima/galactica cp galactica.conf.template galactica.conf cp galactica-env.sh.template galactica-env.sh cp hive-site.xml.template hive-site.xml cp log4j2.xml.template log4j2.xml cp supported_dbs.json.template supported_dbs.json cp notifications.json.template notifications.json cp optimize_index.json.template optimize_index.json cp atlas-application.properties.template atlas-application.properties
Edit the file galactica-env.sh for your favorite text editor, to adapt the java environment variables.
JAVA_HOME: If Java is not already in your PATH, uncomment JAVA_HOME variable and fill it with the path of your JAVA_HOME. Depending on your platform, JAVA_HOME could be for example /usr/lib/jvm/<java-version>.
GALACTICA_MEM: This is the memory that will be allocated to the java virtual machine. We suggest setting this value to 70% of the machine's total RAM, for best performance. In a Yarn context, this value must be no more than 80% of the Yarn assigned memory.
HADOOP_BASE: If you unzipped the Hadoop libraries in /opt/indexima as described earlier and used version 2.8.3, the value of this parameter must be /opt/indexima/hadoop-2.8.3. If you downloaded another version or unzipped the archive in another location, you need to adapt the variable accordingly.
The three variables together should look like this:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export GALACTICA_MEM=8000m export HADOOP_BASE=/opt/hadoop-2.8.3
When YARN is installed on the machine, you can comment the two variables HADOOP_BASE and HADOOP_JARS. Further down, you can just uncomment the other HADOOP_JARS variable. It should look like this
export HADOOP_JARS=$(yarn classpath)
This means that Indexima will use the same classpath as YARN for its Hadoop dependencies.
The three variables together should look something like this
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export GALACTICA_MEM=8000m export HADOOP_JARS=$(yarn classpath)
Copy additional hadoop libraries
If deploying Indexima with a hadoop 3.x version, you will need to copy the tez libraries described in the general requirements, to the galactica/tez folder. Without those libraries, Indexima startup will be much slower and logs will contain error messages from missing tez libraries.
mkdir -p /opt/indexima/galactica/tez cd /opt/indexima/galactica/tez curl -O https://repo1.maven.org/maven2/org/apache/tez/tez-api/0.9.2/tez-api-0.9.2.jar curl -O https://repo1.maven.org/maven2/org/apache/tez/tez-dag/0.9.2/tez-dag-0.9.2.jar
Configure Indexima core engine
Edit the file galactica.conf, to configure the Indexima core engine. Please note that most Indexima parameters are dynamic, and can be set directly in the Indexima console with a command (see galactica.conf for details about available parameters). We will customize only the static parameters in galactica.conf.
The nodes parameters are used to define how Indexima nodes start and discover other Indexima nodes. Those parameters are valid for all Indexima deployments (Standalone, Kubernetes or Yarn).
nodes.requested=Number of expected nodes. The cluster will start instantly if this number of nodes is attached, without waiting for more nodes to join.
nodes.connect.min-nodes=Minimum number of nodes that must be attached before the cluster starts.
With a standard dynamic setup, the cluster will be launched with command
./start-node.sh --attach xxx --host xxx
With a Yarn dynamic setup, the cluster will be launched with command
The Static setup (listing the nodes directly in galactica.conf) are not supported anymore since 2022.1
The warehouse parameters contains the path to the shared filesystem used by Indexima nodes to store data. The details on configuration options are described in Setup shared storage.
Copy everything to other nodes: In order to start Indexima, everything that we did needs to be done in the exact same way on the other node(s) of the cluster. One of the ways to do this is to do a rsync command once everything is set up properly on the master node.
You can read our documentation about Ansible deployment for more advanced cases.
You need to specify a few more parameters for a YARN deployment.
yarn.resourcemanager.hostname = myresourcemanager.intra yarn.memory = 10000 yarn.dir = hdfs://myhdfs.intra:8020/tmp/indexima yarn.name = Indexima
yarn.resourcemanager.hostname is just the hostname of your YARN resource manager. Don't include port number
yarn.memory is the RAM taken by every container. A good practice is to set this value to 10-20% more than the GALACTICA_MEM value set above, in the galactica-env.sh file
yarn.dir is the temporary hdfs folder where the edge node, where we installed everything, will push everything needed by the other nodes in order to start an Indexima cluster
yarn.name is the name of the application as it will appear in the Resource Manager UI
If you are using a Hadoop cluster, you probably already have a Hive server running. Since Indexima is a Hive engine, it uses the same 10000 default port. You will need to change this in order to avoid conflict. In the galactica/conf folder, edit the hive-site.xml file and change the value of the property named hive.server2.thrift.port to 10001.
<property> <name>hive.server2.thrift.port</name> <value>10001</value> </property>
Consistency Checks at Reboot
When starting the indexima service, the system performs some controls to check that everything is in place so the service correctly executes. If the service shuts down quickly, please check the logs to confirm what is missing in the Indexima deployment.
In particular, the following consistency checks are performed, and prevent the service from starting in case of failure:
Existence of mandatory configuration files (galactica.conf, hive-site.xml, log4j2.xml, notifications.json, optimize_index.json, supported_dbs.json)
Availability of working directories (directories listed in configuration, for example for the logs, for the shared warehouse, for temporary files,...)
Validity of critical configuration options (for example, control validity of number of cores or number of readers thread)
Other checks that will generate logs but not prevent the system from starting:
Usage of deprecated properties or unknown properties in configuration files