Need to restart all nodes when main node fails


#1

Hy,

We have two different clusters running Aerospike CE 3.5.9. The first cluster is a 3 nodes one and the second is a 4 nodes. They both run on GCE. Each configuration is using only 6 cpus on the 8 available and local-SSDs.

We have lots of local-ssd problems on GCE and lots of network problems too.

What I’ve seen needs confirmation, but each time one of the nodes is excluded of the cluster and runs in “standalone”, if this is the principal ** node, when it comes back in the cluster all other nodes begin to become unstable during migration and at the end we need each time to restart these nodes.

If the failing node is not the principal, there’s no problem, we just restart it and it come back in the cluster.

Do you have any idea ?

Thanks.

Emmanuel

** Note from @Mnemaudsyne: by ‘principal’, the user means ‘main’.


#2

Hello. We have been improving our GCE support. The next release will include a change that improves cluster stability, specifically when non-default GCE firewall rules are being used and the cluster sees “idle” time > 10 min. when there is no transaction load. Do either of these apply to your case? If so, the work-around until the new release is out is to add firewall rules permitting the TCP ephemeral port range (i.e., 32768 - 61000 by default.)

If not, then it would be helpful for us to receive your Aerospike server logs from before and after the event.

Thanks.


#3

FYI, the change I mentioned is now released in Aerospike Server Community Edition 3.5.12.


#4

Thanks a lot. We’ll install it on Monday on our nodes.

Emmanuel