Need to restart all nodes when main node fails

evinet · June 1, 2015, 1:06pm

Hy,

We have two different clusters running Aerospike CE 3.5.9. The first cluster is a 3 nodes one and the second is a 4 nodes. They both run on GCE. Each configuration is using only 6 cpus on the 8 available and local-SSDs.

We have lots of local-ssd problems on GCE and lots of network problems too.

What I’ve seen needs confirmation, but each time one of the nodes is excluded of the cluster and runs in “standalone”, if this is the principal ** node, when it comes back in the cluster all other nodes begin to become unstable during migration and at the end we need each time to restart these nodes.

If the failing node is not the principal, there’s no problem, we just restart it and it come back in the cluster.

Do you have any idea ?

Thanks.

Emmanuel

** Note from @Mnemaudsyne: by ‘principal’, the user means ‘main’.

psi · June 1, 2015, 7:44pm

Hello. We have been improving our GCE support. The next release will include a change that improves cluster stability, specifically when non-default GCE firewall rules are being used and the cluster sees “idle” time > 10 min. when there is no transaction load. Do either of these apply to your case? If so, the work-around until the new release is out is to add firewall rules permitting the TCP ephemeral port range (i.e., 32768 - 61000 by default.)

If not, then it would be helpful for us to receive your Aerospike server logs from before and after the event.

Thanks.

psi · June 6, 2015, 12:42am

FYI, the change I mentioned is now released in Aerospike Server Community Edition 3.5.12.

evinet · June 6, 2015, 8:37am

Thanks a lot. We’ll install it on Monday on our nodes.

Emmanuel

Topic		Replies	Views
Replication issue : all nodes down when synchronizing after a node restart Configuration	9	2357	November 22, 2016
Cluster not syncing back: try rolling restart or fast restart (AER-4500)	10	2854	November 21, 2015
Cluster Integrity Check: Detected succession list discrepancy at Google Cloud Google Compute Engine (GCE)	2	3932	July 10, 2015
Stale Data Comes Up on Node restart temporarily How Aerospike Works	3	3131	March 21, 2017
Aerospike Exception Operations	4	1211	August 10, 2017

Need to restart all nodes when main node fails

Related topics