Cluster Integrity Check: Detected succession list discrepancy at Google Cloud

Hello,

I am running Aerospike 3.5.14 in Google Cloud over SSD. My cluster is 6 nodes and all the nodes are in the same zone.

For some reasons I can’t explain, after some time I am getting the following error message on one random node:

Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb98be4f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb993def00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb94f22f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9dad0f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb96724f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9e6ddf00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9e6ddf00a0142,bb9dad0f00a0142,bb993def00a0142,bb98be4f00a0142,bb96724f00a0142,bb94f22f00a0142

A simple reboot of the node is resolving the issue. However, this is taking huge amount of time due to the time to reload the indexes in memory.

I have enabled the Keepalives in Aerospike configuration, and also enabled the TCP Keepalives at sysctl level. Also, I have slightly tuned the heartbeat timeout and intervals in mesh networking settings: interval 150 timeout 20

  1. Are those Cluster Integrity Fault “normal” in Aerospike ? They appear rather frequently (approx once a day on a 50GB dataset)

  2. If they are not normal, is there any way to resolve the issue or get a hint on what is causing the issue ?

  3. Does it mean that the network really glitched or it could be something else ?

(I have noted that asadm -e “asinfo -v ‘dump-paxos:’” ; asadm -e “asinfo -v ‘dump-fabric:’” ; asadm -e “asinfo -v ‘dump-hb:’” could be useful commands )

Thank you for your insights,

It happened on another cluster again (5 machines this time):

Jul 08 2015 21:42:33 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb95c6ef00a0142 and self bb9bdc3f00a0142
Jul 08 2015 21:42:33 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb95666f00a0142 and self bb9bdc3f00a0142
Jul 08 2015 21:42:33 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb937aaf00a0142 and self bb9bdc3f00a0142
Jul 08 2015 21:42:33 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb993e0f00a0142 and self bb9bdc3f00a0142
Jul 08 2015 21:42:33 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb94bd7f00a0142 and self bb9bdc3f00a0142
Jul 08 2015 21:42:33 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb993e0f00a0142,bb95c6ef00a0142,bb95666f00a0142,bb94bd7f00a0142,bb937aaf00a0142

During the issue I have ran asadm -e “asinfo -v ‘dump-paxos:’” ; asadm -e “asinfo -v ‘dump-fabric:’” ; asadm -e "asinfo -v ‘dump-hb:’ and the answer was “ok” everywhere

Hi,

Cluster Integrity Fault happens when node(s) cant see each other over network, either due to bad configuration on nodes such as dissimilar namespace config, paxos-protocol & heartbeat protocol versions, not proper interval & time out configurations.

Yes it happens if there is network glitch, if node goes out of cluster for more than interval * timeout value, 150*20=3 secs, cluster integrity fault occurs,

look in the logs which node when out of cluster for brief movement and time it came back into cluster, if time difference is more than configured interval * timeout configuration values, network needs to be investigated, in your case GCE.