Node does not reconnect to cluster after restart

I ran into a problem where after restarting a node it started in a cluster of its own and did not join the rest of the cluster. In the logs it complained about diff in the succession list but usually when that happens you can at least see the other nodes in the cluster where as this time asmonitor/asadm on that node only showed it

asadm -e "asinfo -v services"
i (10.111.139.74) returned:

and here’s some log output:

Nov 05 2015 17:14:21 GMT: INFO (demarshal): (thr_demarshal.c::283) Service started: socket 3000
Nov 05 2015 17:14:21 GMT: INFO (hb): (hb.c::2388) new heartbeat received: bb960cb7f0b0022 principal node is bb9f685900b0022
Nov 05 2015 17:14:21 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb960cb7f0b0022 arrived
Nov 05 2015 17:14:21 GMT: INFO (paxos): (paxos.c::2151) Skip node arrival bb960cb7f0b0022 cluster principal bb94a8b6f0a0022 pulse principal bb9f685900b0022
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #12 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #13 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #14 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #15 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #16 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #17 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #18 for thr_demarshal()
Nov 05 2015 17:14:23 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.171.116.251:3002: timed out
Nov 05 2015 17:14:23 GMT: INFO (demarshal): (thr_demarshal.c::792) Waiting to spawn demarshal threads ...
Nov 05 2015 17:14:23 GMT: INFO (demarshal): (thr_demarshal.c::795) Started 8 Demarshal Threads
Nov 05 2015 17:14:23 GMT: INFO (as): (as.c::450) service ready: soon there will be cake!
Nov 05 2015 17:14:24 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.237.255.3:3002: timed out
Nov 05 2015 17:14:24 GMT: INFO (hb): (hb.c::1055) initiated connection to mesh host at 10.13.162.113:3002 (10.13.162.113:3002) via socket 73 from 10.111.139.74:41415
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9b3057a0b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9ad8b6f0a0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9e305830b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb960cb7f0b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9fea2100a0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb90f88440b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb98a03900b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9f685900b0022,bb9f305830b0022,bb9ee49930b0022,bb9e8429a0b0022,bb9e305830b0022,bb9cea2100a0022,bb9c8429a0b0022,bb9b9849a0b0022,bb9b3057a0b0022,bb9ad8b6f0a0022,bb9938a6f0a0022,bb98f849a0b0022,bb98a03900b0022,bb98703900b0022,bb960cb7f0b0022,bb960805d0a0022,bb926897b0a0022,bb90f88440b0022,bb9fea2100a0022

Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb9b3057a0b0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb9ad8b6f0a0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb9fea2100a0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb90f88440b0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb98a03900b0022 in different cluster - succession lists don't match

I did try to do a dun all/undun all several times but it didn’t help.

Restarting a node like that seems to fix it but it intermittently happened on several of my cluster nodes which became frustrating since a (cold) restart takes a very long time. Any idea how to get out of a situation like that?

Are any other nodes restarting at that time? Is it possible that high network traffic is creating issues when the node attempts to reconnect?

@Ben_Bates,

It was the only node restarted at that time, and no, the network was not particularly loaded (regular traffic).

@Ben_Bates,

This just happened again.

ip-10-61-177-58.ec2.internal:3000      3.5.15        17        false     81     53    (386,2)   BB9E8429A0B0022   BB9FEA2100A0022   425,203,571     41
ip-10-69-174-74.ec2.internal:3000      3.5.15         1        false     81     51      (0,0)   BB9F685900B0022   BB9F685900B0022   435,963,313     55
ip-10-69-76-209.ec2.internal:3000      3.5.15        17        false     82     55    (128,0)   BB9E305830B0022   BB9FEA2100A0022   405,450,556     44
ip-10-93-128-96.ec2.internal:3000      3.5.15        17        false     81     52    (262,0)   BB960805D0A0022   BB9FEA2100A0022   434,174,052     40
...

I also tried to tip the node to another one, as well as to dun-undun it but that didn’t help.

Can you attach an aerospike log to this thread for a couple of problematic nodes and indicate a date and time when it happened?

How did you resolve this in the end? I presume the cluster has reformed?

@Ben_Bates

A subsequent restart of the node usually fixes it - that’s what I did this time too. I’ll save the logs next time it happens.