Skip to main content
Skip table of contents

Setup shared storage

Indexima always requires shared storage, available in read & write to all Indexima nodes of a cluster. This storage is used to store the warehouse (with hyperindexes dump and ingested data) and the logs history shared between nodes.

This guide will show you how to configure this shared storage with various compatible file systems.
The 3 parameters in galactica.conf that need to target a shared storage are:

  • warehouse
  • history.dir
  • history.export

Any storage can be used with any deployment type (Yarn, standalone, Kubernetes).

The LOAD DATA command of Indexima can load files from any storage, independently of the storage used by Indexima for its warehouse. The configuration to authenticate to a storage described here, are also valid in case you need to load data from a storage distinct from the warehouse storage.

HDFS

Indexima can use HDFS storage. Customize your galactica.conf as in the following sample:

galactica.conf

BASH
warehouse = hdfs://hdfs_server/indexima/warehouse
history.dir = hdfs://hdfs_server/indexima/log/history
history.export = hdfs://hdfs_server/indexima/log/history-export
pages.oneFilePerColumn = false
pages = 16000

When using HDFS, it is highly recommended to set pages.oneFilePerColumn to false. During the commit phases, Indexima will be very intensive I/O-wise.

Amazon S3

Indexima can use Amazon S3 storage. Customize your galactica.conf as in the following sample:

galactica.conf

BASH
warehouse = s3a://bucket-name/indexima/warehouse
history.dir = s3a://bucket-name/indexima/log/history
history.export = s3a://bucket-name/indexima/log/history-export

Note that the prefix is 's3a', as defined by the hadoop-aws module.

In order to authorize access to the S3 storage, please follow the hadoop-aws documentation available here. For standard setup, you will just need to add the AWS environment variable in your galactica-env.sh: 

galactica-env.sh

BASH
export AWS_ACCESS_KEY_ID=<access_key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>

Azure ADLS Gen2

Indexima can use Azure ADLS Gen2 Blob storage. Customize your galactica.conf as in the following sample:

galactica.conf

BASH
warehouse = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/warehouse
history.dir = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/log/history
history.export = abfs://YOUR_CONTAINER@YOUR_STORAGE_ACCOUNT.dfs.core.windows.net/log/history-export

Add the following variable to galactice-env.sh:

galactica-env.sh

BASH
export HADOOP_OPTIONAL_TOOLS=hadoop-azure 

Hadoop >=3.3.1 is required for native support of AzureBlobFileSystem.

In order to provide access to your azure storage, add a file core-site.xml to the Indexima galactica/conf folder. See the hadoop documentation for Azure Blob storage for various authentication setup.

Sample configuration for core-site.xml :

core-site.xml

XML
<?xml version="1.0"?>
<configuration>
  <property>
    <name>fs.abfs.impl</name>
    <value>org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem</value>
  </property>
  <property>
    <name>fs.azure.account.auth.type.YOUR_STORAGE_ACCOUNT.dfs.core.windows.net</name>
    <value>SharedKey</value>
  </property>
  <property>
    <name>fs.azure.account.key.YOUR_STORAGE_ACCOUNT.dfs.core.windows.net</name>
    <value>YOUR_ACCESS_KEY</value>
  </property>
</configuration>

Azure ADLS Gen1

Indexima can use Azure ADLS Gen1 storage. Customize your galactica.conf as in the following sample:

galactica.conf

BASH
warehouse = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/warehouse
history.dir = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/log/history
history.export = adl://YOUR_SERVICE_NAME.azuredatalakestore.net/indexima/log/history-export

Add the following variable to galactice-env.sh:

galactica-env.sh

BASH
export HADOOP_OPTIONAL_TOOLS=hadoop-azure 

Hadoop >=3.1.4 is advised for ADLS integration, as Hadoop 2 doesn't embed hadoop-azure module.
For Hadoop2, please add manually hadoop-azure.jar and its dependencies to the share/hadoop/common folder.

In order to provide access to your azure storage, add a file core-site.xml to the Indexima galactica/conf folder. See the hadoop documentation for ADLS Gen1 for various authentication setup.

Sample configuration for core-site.xml :

core-site.xml

XML
<?xml version="1.0"?>
<configuration>
 <property>
    <name>fs.adl.oauth2.client.id</name>
    <value><YOUR_AZURE_APPLICATION_CLIENT_ID></value>
  </property>
  <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
  </property>
  <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>https://login.microsoftonline.com/<YOUR_AZURE_APPLICATION_TENANT_ID>/oauth2/token</value>
  </property>
  <property>
    <name>dfs.adls.oauth2.credential</name>
    <value><YOUR_AZURE_APPLICATION_SECRET></value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>adl://<YOUR_AZURE_DATALAKE_NAME>.azuredatalakestore.net</value>
  </property>
  <property>
    <name>fs.adl.impl</name>
    <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
  </property>
  <property>
    <name>fs.AbstractFileSystem.adl.impl</name>
    <value>org.apache.hadoop.fs.adl.Adl</value>
  </property>
</configuration>

GCP Cloud Storage

Indexima can use GCP Cloud Storage. Customize your galactica.conf as in the following sample:

galactica.conf

BASH
warehouse = gs://bucket-name/path/to/warehouse

You need to set Google Credentials to make this work. The only supported way is to use a credentials.json file generated with the GCP IAM service.
For more information about creating credentials for Google Storage access, refer to the official documentation for service-accounts and using-iam-permissions.
In order to use these credentials, the file must be present on every Indexima node. Then, add the following line to your galactica-env.sh file

galactica-env.sh

BASH
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

CEPH

Indexima can use CEPH Storage with various setups. The standard setup is to use a RADOS gateway to access CEPH with the Hadoop S3A client.
Please refer to the detailed documentation here

The configuration of conf/galactica.conf needs to contain the warehouse.s3.* and s3.* parameters.

In order to perform some ORC or PARQUET load (using standard hadoop libraries), you also need to add a core-site.xml parameter file, with fs.s3a parameters.
Sample configuration for core-site.xml :

core-site.xml

XML
<?xml version="1.0"?>
<configuration>
	<property>
    	<name>fs.s3a.impl</name>
		<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    <property>
      	<name>fs.s3a.endpoint</name>
      	<value>https://YOUR_CEPH_ENDPOINT</value>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <value>YOUR_ACCESS_KEY</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_SECRET_KEY</value>
    </property> 
    <property>
      	<name>fs.s3a.aws.credentials.provider</name>
      	<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
    </property>
    <property>
      	<name>fs.s3a.path.style.access</name>
      	<value>true</value>
    </property>
    <property>
      	<name>fs.s3a.connection.ssl.enabled</name>
      	<value>true</value>
    </property>
</configuration>

Local / NFS

Indexima can use a network filesystem mounted locally as shared storage. You need to mount the shared disk (NFS) so that each node can read and write into the warehouse. In order to do so, use the following configuration in the galactica.conf file.

galactica.conf

BASH
warehouse = /path/to/warehouse

In this example, all nodes must mount the disk to the same /path/to/warehouse path.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.