Clique based evicted nodes while updating preferred principal


#1

clique based evicted nodes while updating preferred principal

Detail

What is meant by the following log message “clique based evicted nodes while updating preferred principal”?

Answer

In Aerospike Server v3.13.x and newer we include clustering improvements. We now use a clique based algorithm for faster cluster formation time.

The message “clique based evicted nodes while updating preferred principal” displays when nodes do a periodic check for proper connectivity to the principal and discover that some node is having connectivity issues to all nodes in the cluster.

   May 02 2018 17:17:17 GMT: INFO (clustering): (clustering.c:7834) clique based evicted nodes while updating preferred principal: bb9b700800a0142 
   May 02 2018 17:17:24 GMT: INFO (clustering): (clustering.c:5540) applied new cluster key 2986399e9565
   May 02 2018 17:17:24 GMT: INFO (clustering): (clustering.c:7834) applied new succession list bb9fd01800a0142 bb9f901800a0142 bb9f801800a0142 bb9f601800a0142
   May 02 2018 17:17:24 GMT: INFO (clustering): (clustering.c:5544) applied cluster size 4

The log output indicates the cluster recognizes a change with one or more nodes. Node bb9b700800a0142 is having connectivity issues possibly as a result of network flakiness and is removed from the cluster. The cluster is reformed with the same succession list, triggering the generation of new partition information and the cluster would get a new cluster key. The cluster size decreases.

   May 02 2018 17:17:31 GMT: INFO (clustering): (clustering.c:7834) clique based evicted nodes while updating preferred principal: bb9b700800a0142
   May 02 2018 17:17:38 GMT: INFO (clustering): (clustering.c:5540) applied new cluster key 9a481c771641
   May 02 2018 17:17:38 GMT: INFO (clustering): (clustering.c:7834) applied new succession list bb9fd01800a0142 bb9f901800a0142 bb9f801800a0142 bb9f601800a0142 bb9b700800a0142
   May 02 2018 17:17:38 GMT: INFO (clustering): (clustering.c:5544) applied cluster size 5

Once the node’s connectivity is restored to a good state the node bb9b700800a0142 is added back to the cluster.

A method to verify this messaging is to review the principal’s aerospike.log where the principal will log the fact that node bb9b700800a0142 is removed indicated by the message “clique based evicted nodes at quantum start:” In addition, at least one other node’s aerospike.log would show that this node bb9b700800a0142 departed around the time it was removed from the cluster.

Notes

It is expected to see clients either timeout or have higher retries to the removed node until the client retrieves an updated partition map, especially for write transactions that typically do not fall back to a replica. By default, it would take a cluster up to 1.5 seconds to detect that a node has left (based on the configured heartbeat interval and timeout), another 1 or 2 seconds for the cluster to reform and clients by default tend for the partition map every 1 second, it could therefore take up to around 4 to 5 seconds for the clients to start issuing transactions that were previously targeted to the removed node against the new owner(s) for the partitions the departed node owned.

When there is a node change the cluster initiates migrations to rebalance the number of partitions across all the available nodes. Migrations occupy various system resources. This data rebalancing mechanism ensures that query volume distributes evenly across all cluster nodes, and is persistent during node failure. Refer to the Automatic Rebalancing write up for details.

Depending on the repartitioning you may or may not see any latency spike. If there is any latency it would be expected to last for a couple of seconds.

Keywords

NETWORK CLUSTER CLIQUE

Timestamp

05/02/18