Monitoring Cluster integrity


#1

Summary

The following article points to specific statistics and server logs which will help identify cluster integrity issues effectively.

Resolution

It is important to have a cluster with a healthy integrity. Which means that the cluster size should be consistent, and all nodes should be visible to all other nodes within the cluster. Loss of cluster integrity might impact clients reads and writes depending on the node that it is able to find in the cluster and the client policies.

Monitoring server logs

From server logs, this is what is seen when a node is seeing cluster integrity fault.

Jan 02 2016 05:12:08 GMT: INFO (paxos): (paxos.c::2600) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=fa12c4d457ac40c
Jan 02 2016 05:12:11 GMT: INFO (paxos): (paxos.c::2541) Cluster Integrity Check: Detected succession list discrepancy between node fa12c4d457ac40c and self fa15412447ac40c

Example of logs when a node fails at the fabric level:

Jan 02 2016 05:11:57 GMT: INFO (hb): (hb.c::2395) hb expires but fabric says DEAD: node fa12c4d457ac40c
Jan 02 2016 05:11:58 GMT: INFO (hb): (hb.c::3046) Marking node removal for paxos recovery: fa12c4d457ac40c
Jan 02 2016 05:11:58 GMT: INFO (hb): (hb.c::2574) removing node on heartbeat failure: fa12c4d457ac40c
Jan 02 2016 05:11:59 GMT: INFO (paxos): (paxos.c::1621) removing failed node fa12c4d457ac40c
Jan 02 2016 05:11:59 GMT: INFO (fabric): (fabric.c::1820) fabric: node fa12c4d457ac40c departed
Jan 02 2016 05:11:59 GMT: INFO (fabric): (fabric.c::1730) fabric disconnecting node: fa12c4d457ac40c

Monitoring server statistics:

Important statistics that need to be monitored closely to handle cluster integrity issue:

  1. cluster_size - Size of the cluster. Should be the same on all nodes.

  2. cluster_integrity - This value indicates whether the cluster is a whole and complete state (as far as the nodes in it are all concerned.) That is, if the value is true, the cluster is intact. Otherwise, if the value is false, there is a cluster integrity fault, information about which will also be logged to the daemon’s log file. A flag would notify that a fault has happened but it should switch back to true once mitigated. A flag of false should be followed up with the server logs verification of consistent logging of integrity faults. If logs are clean, it might be false negative.

  3. cluster_key - Randomly generated 64 bit hexadecimal string used to name the last Paxos cluster state agreement. It should be equivalent on all the nodes in the cluster.

Basically, if cluster_integrity is true, then all the nodes with the same value of cluster_key will be in the same cluster, with size given by the existing cluster_size.

Using Graphite or Nagios:

For monitoring statistics, we do have plugins available to monitor trends. http://www.aerospike.com/docs/operations/monitor/

Post recovery:

During a cluster integrity loss, the cluster will continue to run even as a split-cluster. Clients will be able to read and write data to the node that they are able to connect to be able to successfully make the data available. Once the issue is mitigated, the data will resolve the conflicts based on the conflict-resolution-policy which by default is set to generation to decide which copy of the record will be retained.

Recommendations:

  1. Cluster integrity is impacted heavility by a flaky network or instability. It’s recommended to monitor the network closely.

  2. For new clusters or node additions, it’s always important to double check the configurations to avoid creating any unncessary cluster integrity issues.

  3. It is recommended to upgrade to 3.7.0.1 and above which includes improvements to cluster recovery policies and paxos algorithm.


Cluster Visibility False
Large read latency during a heavy write load