Indexima always requires shared storage, available in read & write to all Indexima nodes of a cluster. This storage is used to store the warehouse (with hyperindexes dump and ingested data) and the logs history shared between nodes.

This guide will show you how to configure this shared storage with various compatible file systems.
The 3 parameters in galactica.conf that need to target a shared storage are:

  • warehouse
  • history.dir
  • history.export

Any storage can be used with any deployment type (Yarn, standalone, Kubernetes).

The LOAD DATA command of Indexima can load files from any storage, independently of the storage used by Indexima for its warehouse. The configuration to authenticate to a storage described here, are also valid in case you need to load data from a storage distinct from the warehouse storage.

HDFS

Indexima can use HDFS storage. Customize your galactica.conf as in the following sample:

galactica.conf

warehouse = hdfs://hdfs_server/indexima/warehouse
history.dir = hdfs://hdfs_server/indexima/log/history
history.export = hdfs://hdfs_server/indexima/log/history-export
pages.oneFilePerColumn = false
pages = 16000
BASH

When using HDFS, it is highly recommended to set pages.oneFilePerColumn to false. During the commit phases, Indexima will be very intensive I/O-wise.

Amazon S3

Indexima can use Amazon S3 storage. Customize your galactica.conf as in the following sample:

galactica.conf

warehouse = s3a://bucket-name/indexima/warehouse
history.dir = s3a://bucket-name/indexima/log/history
history.export = s3a://bucket-name/indexima/log/history-export
BASH

Note that the prefix is 's3a', as defined by the hadoop-aws module.

In order to authorize access to the S3 storage, please follow the hadoop-aws documentation available here. For standard setup, you will just need to add the AWS environment variable in your galactica-env.sh: 

galactica-env.sh

export AWS_ACCESS_KEY_ID=<access_key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>
BASH

Azure ADLS Gen2

Indexima can use Azure ADLS Gen2 Blob storage. Customize your galactica.conf as in the following sample:

galactica.conf

warehouse = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/warehouse
history.dir = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/log/history
history.export = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/log/history-export

BASH

Add the following variable to galactice-env.sh:

galactica-env.sh

export HADOOP_OPTIONAL_TOOLS=hadoop-azure 
BASH

Hadoop >=3.3.1 is required for native support of AzureBlobFileSystem.

In order to provide access to your azure storage, add a file core-site.xml to the Indexima galactica/conf folder. See the hadoop documentation for Azure Blob storage for various authentication setup.

Sample configuration for core-site.xml :

core-site.xml

<?xml version="1.0"?>
<configuration>
  <property>
    <name>fs.abfs.impl</name>
    <value>org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem</value>
  </property>
  <property>
    <name>fs.azure.account.auth.type.YOUR_STORAGE_ACCOUNT.dfs.core.windows.net</name>
    <value>SharedKey</value>
  </property>
  <property>
    <name>fs.azure.account.key.YOUR_STORAGE_ACCOUNT.dfs.core.windows.net</name>
    <value>YOUR_ACCESS_KEY</value>
  </property>
</configuration>
XML

Azure ADLS Gen1

Indexima can use Azure ADLS Gen1 storage. Customize your galactica.conf as in the following sample:

galactica.conf

warehouse = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/warehouse
history.dir = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/log/history
history.export = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/log/history-export
BASH

Add the following variable to galactice-env.sh:

galactica-env.sh

export HADOOP_OPTIONAL_TOOLS=hadoop-azure 
BASH

Hadoop >=3.1.4 is advised for ADLS integration, as Hadoop 2 doesn't embed hadoop-azure module.
For Hadoop2, please add manually hadoop-azure.jar and its dependencies to the share/hadoop/common folder.

In order to provide access to your azure storage, add a file core-site.xml to the Indexima galactica/conf folder. See the hadoop documentation for ADLS Gen1 for various authentication setup.

Sample configuration for core-site.xml :

core-site.xml

<?xml version="1.0"?>
<configuration>
 <property>
    <name>fs.adl.oauth2.client.id</name>
    <value><YOUR_AZURE_APPLICATION_CLIENT_ID></value>
  </property>
  <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
  </property>
  <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>https://login.microsoftonline.com/<YOUR_AZURE_APPLICATION_TENANT_ID>/oauth2/token</value>
  </property>
  <property>
    <name>dfs.adls.oauth2.credential</name>
    <value><YOUR_AZURE_APPLICATION_SECRET></value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>adl://<YOUR_AZURE_DATALAKE_NAME>.azuredatalakestore.net</value>
  </property>
  <property>
    <name>fs.adl.impl</name>
    <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
  </property>
  <property>
    <name>fs.AbstractFileSystem.adl.impl</name>
    <value>org.apache.hadoop.fs.adl.Adl</value>
  </property>
</configuration>
XML

GCP Cloud Storage

Indexima can use GCP Cloud Storage. Customize your galactica.conf as in the following sample:

galactica.conf

warehouse = gs://bucket-name/path/to/warehouse
BASH

You need to set Google Credentials to make this work. The only supported way is to use a credentials.json file generated with the GCP IAM service.
For more information about creating credentials for Google Storage access, refer to the official documentation for service-accounts and using-iam-permissions.
In order to use these credentials, the file must be present on every Indexima node. Then, add the following line to your galactica-env.sh file

galactica-env.sh

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
BASH

CEPH

Indexima can use CEPH Storage with various setups. The standard setup is to use a RADOS gateway to access CEPH with the Hadoop S3A client.
Please refer to the detailed documentation here. The property fs.s3a.path.style.access is required in core-site.xml.

The configuration of conf/galactica.conf needs to contain the warehouse.s3.* parameters, whereas the warehouse parameter must be omitted.


Sample configuration for core-site.xml :

core-site.xml

<?xml version="1.0"?>
<configuration>
	<property>
    	<name>fs.s3a.impl</name>
		<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    <property>
      	<name>fs.s3a.endpoint</name>
      	<value>https://YOUR_CEPH_ENDPOINT</value>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <value>YOUR_ACCESS_KEY</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_SECRET_KEY</value>
    </property> 
    <property>
      	<name>fs.s3a.aws.credentials.provider</name>
      	<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
    </property>
    <property>
      	<name>fs.s3a.path.style.access</name>
      	<value>true</value>
    </property>
    <property>
      	<name>fs.s3a.connection.ssl.enabled</name>
      	<value>true</value>
    </property>
</configuration>
XML

Local / NFS

Indexima can use a network filesystem mounted locally as shared storage. You need to mount the shared disk (NFS) so that each node can read and write into the warehouse. In order to do so, use the following configuration in the galactica.conf file.

galactica.conf

warehouse = /path/to/warehouse
BASH

In this example, all nodes must mount the disk to the same /path/to/warehouse path.