Hello,
I am running Aerospike 3.5.14 in Google Cloud over SSD. My cluster is 6 nodes and all the nodes are in the same zone.
For some reasons I can’t explain, after some time I am getting the following error message on one random node:
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb98be4f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb993def00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb94f22f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9dad0f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb96724f00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9e6ddf00a0142 and self bb9f43af00a0142
Jul 08 2015 20:24:16 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes=bb9e6ddf00a0142,bb9dad0f00a0142,bb993def00a0142,bb98be4f00a0142,bb96724f00a0142,bb94f22f00a0142
A simple reboot of the node is resolving the issue. However, this is taking huge amount of time due to the time to reload the indexes in memory.
I have enabled the Keepalives in Aerospike configuration, and also enabled the TCP Keepalives at sysctl level. Also, I have slightly tuned the heartbeat timeout and intervals in mesh networking settings: interval 150 timeout 20
-
Are those Cluster Integrity Fault “normal” in Aerospike ? They appear rather frequently (approx once a day on a 50GB dataset)
-
If they are not normal, is there any way to resolve the issue or get a hint on what is causing the issue ?
-
Does it mean that the network really glitched or it could be something else ?
(I have noted that asadm -e “asinfo -v ‘dump-paxos:’” ; asadm -e “asinfo -v ‘dump-fabric:’” ; asadm -e “asinfo -v ‘dump-hb:’” could be useful commands )
Thank you for your insights,