Indexima warehouses Storage

This guide will show you how to configure and use the Indexima warehouse with different compatible file systems.

Local (not shared)

When using a single Indexima instance, you can and should use the local disk of the machine for better performance.

This can be set in the galactica.conf file

galactica.conf

BASH

warehouse.protocol = LOCAL
warehouse = /path/to/warehouse
warehouse.shared = false

Local (shared) / NFS

When using an Indexima cluster (more than 1 node), You need to mount a shared disk (NFS) so each node can read and write into a shared warehouse. In order to do so, use the following configuration in the galactica.conf file

galactica.conf

BASH

warehouse.protocol = LOCAL
warehouse = /path/to/warehouse
warehouse.shared = true

In this example, all nodes must mount the disk to the same /path/to/warehouse path

HDFS

If you are deploying Indexima with YARN, you probably need to use an HDFS to store the Indexima warehouse. You can however use HDFS even in a standalone deployment

galactica.conf

BASH

warehouse.protocol = HDFS
warehouse = hdfs://path/to/warehouse
pages.oneFilePerColumn = false
pages = 16000

When using HDFS, it is highly recommended to set pages.oneFilePerColumn to false. During the commit phases, Indexima will be very intensive I/O wise.

Amazon S3

You can also use Amazon S3 to store the warehouse.

galactica.conf

BASH

warehouse.protocol = S3
warehouse = s3a://bucket-name/path/to/warehouse

Note that the prefix is 's3a'. This is on the Indexima side and must not be omitted.

You need to set the correct AWS credentials to make this work. The two recommended ways are: environment variable and instance role

Environment variables

Add these two lines in your galactica-env.sh

galactica-env.sh

BASH

export AWS_ACCESS_KEY_ID=<access_key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>

Instance Role

If your Indexima instances are AWS EC2, you can attach a role with the proper permissions on them, and the S3 connection will work without further configuration

GCP Cloud Storage

You can use GCP Cloud Storage for the same usage as AWS S3.

galactica.conf

BASH

warehouse.protocol = GS
warehouse = gs://bucket-name/path/to/warehouse

You need to set Google Credentials to make this work. The only supported way is to use a credentials.json file generated with the GCP IAM service.

For more information about creating credentials for Google Storage access, refer to the official documentation:

In order to use these credentials, the file must be present in every Indexima instance. Then, add the following line to your galactica-env.sh file

galactica-env.sh

BASH

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Azure DataLake Gen1

If you are deploying Indexima in Azure, you might want to use Azure Data Lake for storage. Indexima currently supports only ADLS Gen1.

Azure Data Lake behaves like HDFS. First, you need to set your warehouse in galactica.conf.

galactica.conf

BASH

warehouse.protocol = ADL
warehouse = adl://bucket-name/path/to/warehouse

You then need to set correct access. In order to that, you need to customize the core-site.xml file located in the Hadoop lib you installed during the prerequisites. You also must add some java libraries to the common hadoop libs

Customize core-site.xml

If you installed it it the same place as recommended in the documentation, the file should be located here: /opt/hadoop-2.8.3/etc/hadoop/core-site.xml

core-site.xml

XML

<configuration>
 <property>
    <name>fs.adl.oauth2.client.id</name>
    <value><YOUR_AZURE_APPLICATION_CLIENT_ID></value>
  </property>
  <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
  </property>
  <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>https://login.microsoftonline.com/<YOUR_AZURE_APPLICATION_TENANT_ID>/oauth2/token</value>
  </property>
  <property>
    <name>dfs.adls.oauth2.credential</name>
    <value><YOUR_AZURE_APPLICATION_SECRET></value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>adl://<YOUR_AZURE_DATALAKE_NAME>.azuredatalakestore.net</value>
  </property>
  <property>
    <name>fs.adl.impl</name>
    <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
  </property>
  <property>
    <name>fs.AbstractFileSystem.adl.impl</name>
    <value>org.apache.hadoop.fs.adl.Adl</value>
  </property>
</configuration>

Add Azure java lib

You need to add several jar files to the share/hadoop/common folder