LOAD DATA (Files Load)

The LOAD DATA command allows you to write raw data into Indexima tables.

Syntax

Load Data from local disk

SQL

LOAD DATA [LOCAL] 
INPATH '<protocol>://<absolute_path_to_data>'
INTO TABLE <table_name> 
FORMAT CSV|ORC|PARQUET|JSON
SEPARATOR ',|;|\t' 
[SKIP <number_of_row_to_skip> ]
[LIMIT <limitnumber>]
[PATTERN <pattern>]
[PREFIX <prefix>]
[SUFFIX <suffix>]
[NOCHECK]

Parameters

Parameter name	Description	Possible values
LOCAL	If used, this keyword tells the Master node to read data from his own filesystem. When not specified, every node must be able to read the data located in <absolute_path_to_data> When specified, <protocol> is not mandatory	Empty \| LOCAL
protocol	The type of datalake you want to load data from.	s3 \| adl \| gs \| hdfs
absolute_path_to_data	The absolute path to a folder containing all the files you want to load data from. If LOCAL is specified, only the Master node needs to have read access to this path.	Any absolute path is readable by the Indexima nodes.
table_name	The name of the Indexima table you want to load the data into.	The exact name of the Indexima table you want to load data into. The structure of the files must match the structure of the table.
number_of_row_to_skip	The number of rows to skip before starting the load. This is used to skip the first or the 2 first rows, which are usually header rows.	Any positive number.
limitnumber	Maximum Number of rows that will be read from each file	Any positive number
pattern	Only files whose name begins with the defined pattern would be loaded. Must end by a star.
prefix	see below" Load partitions from Hive table "
Suffix	supports only CSV format, same concept as prefix applied for missing column at the end of each line
NoCheck	Without this option, INDEXIMA checks if loaded values are compatible with the datatype set when creating the table. INDEXIMA triggers errors when datatypes do not match. Specifying `NOCHECK` skips this test and improve dramatically the performance of the loading. This is a good practice to run once `LOAD DATA` with data check enable. Then `NOCHECK` can be set for recurrent data loading to speed up the load process

Load Example

Example

SQL

LOAD DATA 
INPATH 'mypath'
INTO TABLE myschema.mytable
FORMAT CSV
SEPARATOR ',' 
SKIP 1
LIMIT 10
PATTERN 'Mypattern*'
PREFIX 'myprefix'
NOCHECK

Load partitions from Hive table

Provide a path of the root partition if you want to load only a set of partitions and not all partitions from a Hive table and add PREFIX 'column_values' to specify the values for the partition columns.

SQL

LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/my_imported_data_orc/year=2018/month=12/day=10'
INTO TABLE my_test 
FORMAT orc 
PREFIX '2018,12,10,'

If the Hive table is in tab-separated format, the separator must be \t so the statement must be PREFIX '2018\t12\t10\t'

Load Example with a suffix

Add SUFFIX 'column_values' to specify the values that will finish the CSV line when there are fewer column than expected (defined by the list of columns in the table creation)

SQL

LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/myORCtable'
INTO TABLE my_test 
FORMAT CSV
SUFFIX 'XX,YY'