I ran into a problem where after restarting a node it started in a cluster of its own and did not join the rest of the cluster. In the logs it complained about diff in the succession list but usually when that happens you can at least see the other nodes in the cluster where as this time asmonitor/asadm on that node only showed it
asadm -e "asinfo -v services"
i (10.111.139.74) returned:
and here’s some log output:
Nov 05 2015 17:14:21 GMT: INFO (demarshal): (thr_demarshal.c::283) Service started: socket 3000
Nov 05 2015 17:14:21 GMT: INFO (hb): (hb.c::2388) new heartbeat received: bb960cb7f0b0022 principal node is bb9f685900b0022
Nov 05 2015 17:14:21 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb960cb7f0b0022 arrived
Nov 05 2015 17:14:21 GMT: INFO (paxos): (paxos.c::2151) Skip node arrival bb960cb7f0b0022 cluster principal bb94a8b6f0a0022 pulse principal bb9f685900b0022
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #12 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #13 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #14 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #15 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #16 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #17 for thr_demarshal()
Nov 05 2015 17:14:22 GMT: INFO (demarshal): (thr_demarshal.c::255) Saved original JEMalloc arena #18 for thr_demarshal()
Nov 05 2015 17:14:23 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.171.116.251:3002: timed out
Nov 05 2015 17:14:23 GMT: INFO (demarshal): (thr_demarshal.c::792) Waiting to spawn demarshal threads ...
Nov 05 2015 17:14:23 GMT: INFO (demarshal): (thr_demarshal.c::795) Started 8 Demarshal Threads
Nov 05 2015 17:14:23 GMT: INFO (as): (as.c::450) service ready: soon there will be cake!
Nov 05 2015 17:14:24 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.237.255.3:3002: timed out
Nov 05 2015 17:14:24 GMT: INFO (hb): (hb.c::1055) initiated connection to mesh host at 10.13.162.113:3002 (10.13.162.113:3002) via socket 73 from 10.111.139.74:41415
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9b3057a0b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9ad8b6f0a0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9e305830b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb960cb7f0b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9fea2100a0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb90f88440b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb98a03900b0022 and self bb94a8b6f0a0022
Nov 05 2015 17:14:26 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes=bb9f685900b0022,bb9f305830b0022,bb9ee49930b0022,bb9e8429a0b0022,bb9e305830b0022,bb9cea2100a0022,bb9c8429a0b0022,bb9b9849a0b0022,bb9b3057a0b0022,bb9ad8b6f0a0022,bb9938a6f0a0022,bb98f849a0b0022,bb98a03900b0022,bb98703900b0022,bb960cb7f0b0022,bb960805d0a0022,bb926897b0a0022,bb90f88440b0022,bb9fea2100a0022
…
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb9b3057a0b0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb9ad8b6f0a0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb9fea2100a0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb90f88440b0022 in different cluster - succession lists don't match
Nov 05 2015 17:14:51 GMT: INFO (hb): (hb.c::2319) HB node bb98a03900b0022 in different cluster - succession lists don't match
I did try to do a dun all/undun all several times but it didn’t help.
Restarting a node like that seems to fix it but it intermittently happened on several of my cluster nodes which became frustrating since a (cold) restart takes a very long time. Any idea how to get out of a situation like that?