Indexima warehouses Storage
This guide will show you how to configure and use the Indexima warehouse with different compatible file systems.
Local (not shared)
When using a single Indexima instance, you can and should use the local disk of the machine for better performance.
This can be set in the galactica.conf file
galactica.conf
warehouse.protocol = LOCAL
warehouse = /path/to/warehouse
warehouse.shared = false
Local (shared) / NFS
When using an Indexima cluster (more than 1 node), You need to mount a shared disk (NFS) so each node can read and write into a shared warehouse. In order to do so, use the following configuration in the galactica.conf file
galactica.conf
warehouse.protocol = LOCAL
warehouse = /path/to/warehouse
warehouse.shared = true
In this example, all nodes must mount the disk to the same /path/to/warehouse path
HDFS
If you are deploying Indexima with YARN, you probably need to use an HDFS to store the Indexima warehouse. You can however use HDFS even in a standalone deployment
galactica.conf
warehouse.protocol = HDFS
warehouse = hdfs://path/to/warehouse
pages.oneFilePerColumn = false
pages = 16000
When using HDFS, it is highly recommended to set pages.oneFilePerColumn to false. During the commit phases, Indexima will be very intensive I/O wise.
Amazon S3
You can also use Amazon S3 to store the warehouse.
galactica.conf
warehouse.protocol = S3
warehouse = s3a://bucket-name/path/to/warehouse
Note that the prefix is 's3a'. This is on the Indexima side and must not be omitted.
You need to set the correct AWS credentials to make this work. The two recommended ways are: environment variable and instance role
Environment variables
Add these two lines in your galactica-env.sh
galactica-env.sh
export AWS_ACCESS_KEY_ID=<access_key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>
Instance Role
If your Indexima instances are AWS EC2, you can attach a role with the proper permissions on them, and the S3 connection will work without further configuration
GCP Cloud Storage
You can use GCP Cloud Storage for the same usage as AWS S3.
galactica.conf
warehouse.protocol = GS
warehouse = gs://bucket-name/path/to/warehouse
You need to set Google Credentials to make this work. The only supported way is to use a credentials.json file generated with the GCP IAM service.
For more information about creating credentials for Google Storage access, refer to the official documentation:
- https://cloud.google.com/iam/docs/service-accounts
- https://cloud.google.com/storage/docs/access-control/using-iam-permissions
In order to use these credentials, the file must be present in every Indexima instance. Then, add the following line to your galactica-env.sh file
galactica-env.sh
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
Azure DataLake Gen1
If you are deploying Indexima in Azure, you might want to use Azure Data Lake for storage. Indexima currently supports only ADLS Gen1.
Azure Data Lake behaves like HDFS. First, you need to set your warehouse in galactica.conf.
galactica.conf
warehouse.protocol = ADL
warehouse = adl://bucket-name/path/to/warehouse
You then need to set correct access. In order to that, you need to customize the core-site.xml file located in the Hadoop lib you installed during the prerequisites. You also must add some java libraries to the common hadoop libs
Customize core-site.xml
If you installed it it the same place as recommended in the documentation, the file should be located here: /opt/hadoop-2.8.3/etc/hadoop/core-site.xml
core-site.xml
<configuration>
<property>
<name>fs.adl.oauth2.client.id</name>
<value><YOUR_AZURE_APPLICATION_CLIENT_ID></value>
</property>
<property>
<name>fs.adl.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>fs.adl.oauth2.refresh.url</name>
<value>https://login.microsoftonline.com/<YOUR_AZURE_APPLICATION_TENANT_ID>/oauth2/token</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value><YOUR_AZURE_APPLICATION_SECRET></value>
</property>
<property>
<name>fs.defaultFS</name>
<value>adl://<YOUR_AZURE_DATALAKE_NAME>.azuredatalakestore.net</value>
</property>
<property>
<name>fs.adl.impl</name>
<value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.adl.impl</name>
<value>org.apache.hadoop.fs.adl.Adl</value>
</property>
</configuration>
Add Azure java lib
You need to add several jar files to the share/hadoop/common folder
- https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.21/slf4j-api-1.7.21.jar
- https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar
- https://repo1.maven.org/maven2/com/squareup/okio/okio/1.4.0/okio-1.4.0.jar
- https://repo1.maven.org/maven2/com/squareup/okhttp/okhttp/2.4.0/okhttp-2.4.0.jar
- https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure-datalake/2.8.5/hadoop-azure-datalake-2.8.5.jar
- https://repo1.maven.org/maven2/com/microsoft/azure/azure-data-lake-store-sdk/2.3.7/azure-data-lake-store-sdk-2.3.7.jar