Cluster join issue

I have a 3-node cluster. One node failed to join cluster despite the primary successfully received and accepted the join request from failing node.

Any idea why ?

Hi @lutfik , welcome to the community! Without more information, it would be difficult to diagnose a problem here. I would suggest starting by looking at the logs to check for error messages as well as checking out our troubleshooting documents.

Hi @aanderson,

Here’s a clipped log from node failed to join cluster:

Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:169) NODE-ID bb9deda2f3e16fa CLUSTER-SIZE 0
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:251)    cluster-clock: skew-ms 0
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:274)    system: total-cpu-pct 1 user-cpu-pct 1 kernel-cpu-pct 0 free-mem-kbytes 127380740 free-mem-pct 96
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:297)    process: cpu-pct 1 threads (17,81,58,58) heap-kbytes (2163416,2164176,2222080) heap-efficiency-pct 97.4
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:308)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:331)    fds: proto (0,1,1) heartbeat (0,0,0) fabric (24,24,0)
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:339)    heartbeat-received: self 35705 foreign 71446
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:369)    fabric-bytes-per-second: bulk (0,0) ctrl (248,31) meta (0,0) rw (0,0)
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:432) {mnemonic} objects: all 0 master 0 prole 0 non-replica 0
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:494) {mnemonic} migrations: complete
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:513) {mnemonic} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 data-bytes 0 used-pct 0.00
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:432) {knox} objects: all 0 master 0 prole 0 non-replica 0
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:494) {knox} migrations: complete
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:513) {knox} memory-usage: total-bytes 409856 index-bytes 0 set-index-bytes 0 sindex-bytes 409856 data-bytes 0 used-pct 0.00
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (info): (ticker.c:577) {knox} device-usage: used-bytes 0 avail-pct 99
Feb 18 01:17:07 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:07 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa
Feb 18 01:17:08 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:08 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa
Feb 18 01:17:09 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:09 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa
Feb 18 01:17:10 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:10 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa
Feb 18 01:17:11 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:11 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa
Feb 18 01:17:12 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:12 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa
Feb 18 01:17:13 sf-notif-aerospike02.novalocal asd[26696]: Feb 17 2022 18:17:13 GMT: INFO (clustering): (clustering.c:5988) sent cluster join request to bb9de33e53e16fa

And, here’s from primary node that already accepted the join request:

Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:169) NODE-ID bb9de33e53e16fa CLUSTER-SIZE 2
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:251)    cluster-clock: skew-ms 1
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:274)    system: total-cpu-pct 113 user-cpu-pct 69 kernel-cpu-pct 44 free-mem-kbytes 107537592 free-mem-pct 81
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:297)    process: cpu-pct 133 threads (17,81,62,58) heap-kbytes (21466337,22203384,22956032) heap-efficiency-pct 93.5
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:308)    in-progress: info-q 0 rw-hash 3 proxy-hash 0 tree-gc-q 0
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:331)    fds: proto (132,2522852,2522720) heartbeat (0,0,0) fabric (36,138,102)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:339)    heartbeat-received: self 1169521 foreign 1168291
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:369)    fabric-bytes-per-second: bulk (0,0) ctrl (34,241) meta (0,0) rw (1559157,1573847)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:401)    early-fail: demarshal 0 tsvc-client 4 tsvc-from-proxy 0 tsvc-batch-sub 0 tsvc-from-proxy-batch-sub 0 tsvc-udf-sub 0 tsvc-ops-sub 0
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:432) {mnemonic} objects: all 0 master 0 prole 0 non-replica 0
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:494) {mnemonic} migrations: complete
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:513) {mnemonic} memory-usage: total-bytes 0 index-bytes 0 set-index-bytes 0 sindex-bytes 0 data-bytes 0 used-pct 0.00
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:432) {knox} objects: all 39810999 master 19606650 prole 20204349 non-replica 0
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:494) {knox} migrations: complete
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:513) {knox} memory-usage: total-bytes 16678147631 index-bytes 2547903936 set-index-bytes 0 sindex-bytes 7841576586 data-bytes 6288667109 used-pct 48.54
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:577) {knox} device-usage: used-bytes 7805781424 avail-pct 95
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:649) {knox} client: tsvc (0,0) proxy (1636,0,0) read (2716275,0,0,56028797,0) write (566417121,9972,35,0) delete (0,0,0,0,0) udf (0,0,0,0) lang (0,0,0,0)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:720) {knox} from-proxy: tsvc (0,0) read (1,0,0,20,0) write (339,0,0,0) delete (0,0,0,0,0) udf (0,0,0,0) lang (0,0,0,0)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:826) {knox} scan: basic (0,11,0) aggr (0,0,0) udf-bg (0,0,0) ops-bg (0,0,0)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:854) {knox} query: basic (204282301,0) aggr (0,0) udf-bg (0,0) ops-bg (0,0)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:923) {knox} dup-res: ask 2672356 respond (104,1234530)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (ticker.c:964) {knox} retransmits: migration 0 all-read 0 all-write (0,47) all-delete (0,0) all-udf (0,0) all-batch-sub 0 udf-sub (0,0) ops-sub (0,0)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:321) histogram dump: {knox}-read (58745072 total) msec
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (00: 0058744258) (01: 0000000171) (02: 0000000171) (03: 0000000213)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (04: 0000000216) (05: 0000000038) (06: 0000000004) (07: 0000000001)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:321) histogram dump: {knox}-write (566427093 total) msec
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (00: 0527084952) (01: 0024364956) (02: 0008168687) (03: 0003140520)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (04: 0001229898) (05: 0001867514) (06: 0000543879) (07: 0000026092)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (08: 0000000515) (09: 0000000030) (10: 0000000046) (11: 0000000004)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:321) histogram dump: {knox}-query (204282301 total) msec
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (00: 0204265798) (01: 0000009006) (02: 0000003895) (03: 0000001991)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (04: 0000001345) (05: 0000000252) (06: 0000000012) (07: 0000000002)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:321) histogram dump: {knox}-query-rec-count (41484184 total) count
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (info): (hist.c:331)  (01: 0040686913) (02: 0000283815) (03: 0000513426) (10: 0000000030)
Feb 18 01:17:01 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:01 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:02 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:02 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:03 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:03 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:04 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:04 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox07.dat: used-bytes 780132736 free-wblocks 23509 write-q 0 write (49157,0.3) defrag-q 0 defrag-read (48927,0.4) defrag-write (24438,0.2)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox08.dat: used-bytes 780644144 free-wblocks 23510 write-q 0 write (49176,0.4) defrag-q 0 defrag-read (48940,0.4) defrag-write (24446,0.2)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox09.dat: used-bytes 781116608 free-wblocks 23496 write-q 0 write (49240,0.4) defrag-q 0 defrag-read (48987,0.6) defrag-write (24470,0.2)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox05.dat: used-bytes 780215632 free-wblocks 23500 write-q 0 write (49117,0.3) defrag-q 0 defrag-read (48857,0.2) defrag-write (24405,0.2)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox01.dat: used-bytes 780526224 free-wblocks 23502 write-q 0 write (49067,0.4) defrag-q 0 defrag-read (48845,0.4) defrag-write (24391,0.2)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox06.dat: used-bytes 780326112 free-wblocks 23492 write-q 0 write (49225,0.6) defrag-q 0 defrag-read (48967,0.6) defrag-write (24459,0.3)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox04.dat: used-bytes 780984688 free-wblocks 23494 write-q 0 write (49153,0.4) defrag-q 0 defrag-read (48912,0.6) defrag-write (24429,0.2)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox10.dat: used-bytes 780730608 free-wblocks 23496 write-q 0 write (49202,0.5) defrag-q 0 defrag-read (48957,0.6) defrag-write (24454,0.3)
Feb 18 01:17:05 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:05 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox03.dat: used-bytes 781070880 free-wblocks 23495 write-q 0 write (49178,0.4) defrag-q 0 defrag-read (48946,0.6) defrag-write (24445,0.2)
Feb 18 01:17:06 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:06 GMT: INFO (drv_ssd): (drv_ssd.c:1830) {knox} /data/aerospike/knox02.dat: used-bytes 780034592 free-wblocks 23509 write-q 0 write (49132,0.5) defrag-q 0 defrag-read (48906,0.7) defrag-write (24428,0.3)
Feb 18 01:17:06 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:06 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:07 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:07 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:08 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:08 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:09 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:09 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa
Feb 18 01:17:10 sf-notif-aerospike03.novalocal asd[11799]: Feb 17 2022 18:17:10 GMT: INFO (clustering): (clustering.c:7259) accepted join request from node bb9deda2f3e16fa

So, from the log seems that the join cluster request already sent, received, and approved. But, the node still not joining to cluster. Could it be timestamp issue ??

It could be a one way network failure where the node trying to join is able to send fine but is not receiving?

Having said that, as I am noticing there are no heartbeat connections heartbeat (0,0,0) on either node, is this a multicast setup for heartbeat? In which case maybe there is some issue on that front? Are all three nodes in the same subnet? Could it be something to do with the multicast-ttl configuration?

I’m using multicast The nodes are on private cloud with private IP 172.x.x.x in same subnet. However, when sending traffic outside, the IPs are NAT-ed to 10.16.x.x. So, in aero config, I have to set the access-address to 10.16.x.x FYI, iptables and firewalld are disabled.

I see… the access-address would only be used by the clients I believe… not sure for the fabric communication (inter node channel to do the exchange of information to allow clustering). But this is likely the issue. Have you tried (just to troubleshoot) using mesh? this would force to set an IP address for ‘seeding’ the nodes and may help troubleshoot further. But I would expect the nodes would have to be able to communicate through the IP address set under the address configuration when it comes to the fabric part… (at least from the top of my head thinking about this now).