Monitor an Indexima Cluster

In order to deliver maximum performance to users, it is recommended to monitor an Indexima cluster. This will allow anticipating some trouble before actual users experienced it.

it is recommended to perform those actions using a dedicated user. This will facilitate the log reading.

Live Monitoring

Overall Health status

Check Node Status every X minutes

Indexima provides a Cluster API that allows checking the health of each node. Each node should return at least the following results:

"status": RUNNING
"attached": TRUE

Rule

If there is at least a node that is not providing a running & attached status, this means the cluster is not working on a nominal case.

According to the High-Availability feature, Indexima will adapt. However, if this phenomenon is not scheduled, there is some analysis to perform.

Send Show Index & Show Dictionaries every X minutes

Use the Indexes & Dictionaries API to execute SHOW MEMORY & SHOW DICTIONARIES.

Rule

If the recurring operations last more than the average time (+ margin), it is recommended to restart the cluster.

Check Memory low events during the past X minutes

Use the Events API on EACH node to catch the Memory low events.

Memory Low events are normal events: when there is not enough memory to answer queries or to load data, the cluster will unload indexes to free space. In a very few cases, despite the fact that the system freed space for a certain amount of time, there is not enough space, and the system can't answer anymore.

Rule

If there are Memory low events during more than 15 min, it is recommended to restart the cluster.

Data Analysts usage

Queries performance

Use the Queries API to check the performance of the past SELECT queries.

Rule

If the average response time (over a period) reaches a threshold, users are experiencing some slowness.

Queries Errors

Use the Queries API to check errors in the past SELECT queries.

Rule

If there are too many TimeOut errors, users are experiencing some slowness.

[Optional] Send Queries on critical tables

Define a list of critical tables. Send a list of queries on some critical tables.

Rule

If the average response time (over a period) reaches a threshold, and there are no writing operations on the specific tables, users are experiencing some slowness.

[Optional] Send Queries on a dummy table every Y minutes

Objective: Sends a set of queries that will reproduce data-analysts usage.

An example is provided here (*). This example has to be split into 2 parts

Part1: Create & load the data in order to set up the dummy data. This part has to be done once.
Part2: 2 select Queries to be sent

Rule

If the recurring operations last more than the average time (+ margin), it is recommended to restart the cluster.

Take into consideration that if the query response time is more than 1sec (default value), the query gets cached, thus the response time is 1 or 2 ms.

Checking on dummy tables will not ensure that the production tables are usable or not.

(*)This example is using data from the Indexima bench. Refer to this page to get the data & more information.

Off-line Monitoring

Check memory size used for indexes

Use Indexes & Dictionaries API to check the Memory size of indexes.

Administrators may aggregate the results by functional domain and check the size with the maximum limit that has been shared.

Check queries performance

Use the page Get Queries History to check the queries performance.