LOAD DATA (Files Load)
The LOAD DATA command allows you to write raw data into Indexima tables.
Syntax
Load Data from local disk
LOAD DATA [LOCAL]
INPATH '<protocol>://<absolute_path_to_data>'
INTO TABLE <table_name>
FORMAT CSV|ORC|PARQUET|JSON
SEPARATOR ',|;|\t'
[SKIP <number_of_row_to_skip> ]
[LIMIT <limitnumber>]
[PATTERN <pattern>]
[PREFIX <prefix>]
[SUFFIX <suffix>]
[NOCHECK]
Parameters
Parameter name | Description | Possible values |
---|---|---|
LOCAL | If used, this keyword tells the Master node to read data from his own filesystem. | Empty | LOCAL |
protocol | The type of datalake you want to load data from. | s3 | adl | gs | hdfs |
absolute_path_to_data | The absolute path to a folder containing all the files you want to load data from. If LOCAL is specified, only the Master node needs to have read access to this path. | Any absolute path is readable by the Indexima nodes. |
table_name | The name of the Indexima table you want to load the data into. | The exact name of the Indexima table you want to load data into. The structure of the files must match the structure of the table. |
number_of_row_to_skip | The number of rows to skip before starting the load. This is used to skip the first or the 2 first rows, which are usually header rows. | Any positive number. |
limitnumber | Maximum Number of rows that will be read from each file | Any positive number |
pattern | Only files whose name begins with the defined pattern would be loaded. Must end by a star. | |
prefix | see below" Load partitions from Hive table " | |
Suffix | supports only CSV format, same concept as prefix applied for missing column at the end of each line | |
NoCheck | Without this option, INDEXIMA checks if loaded values are compatible with the datatype set when creating the table. INDEXIMA triggers errors when datatypes do not match. |
Load Example
Example
LOAD DATA
INPATH 'mypath'
INTO TABLE myschema.mytable
FORMAT CSV
SEPARATOR ','
SKIP 1
LIMIT 10
PATTERN 'Mypattern*'
PREFIX 'myprefix'
NOCHECK
Load partitions from Hive table
Provide a path of the root partition if you want to load only a set of partitions and not all partitions from a Hive table and add PREFIX 'column_values' to specify the values for the partition columns.
LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/my_imported_data_orc/year=2018/month=12/day=10'
INTO TABLE my_test
FORMAT orc
PREFIX '2018,12,10,'
If the Hive table is in tab-separated format, the separator must be \t so the statement must be PREFIX '2018\t12\t10\t'
Load Example with a suffix
Add SUFFIX 'column_values' to specify the values that will finish the CSV line when there are fewer column than expected (defined by the list of columns in the table creation)
LOAD DATA INPATH 'hdfs://data_node:8020/apps/hive/warehouse/myORCtable'
INTO TABLE my_test
FORMAT CSV
SUFFIX 'XX,YY'