0 downtime configuration


#1

by maxulan » Tue Aug 05, 2014 6:19 am

Hello,

I’m testing 4 nodes cluster at EC2. My task is to provide 0 downtime for writes in case of node failure. With default config when I stop one node writes fail to all the nodes until cluster is reconfigured. It usually takes a few seconds.

Is there any way to get rid of this “a few seconds downtime”?

Thanks, Max


#2

by devops02 » Tue Aug 05, 2014 2:52 pm

Hi Max,

Thank you for using Aerospike and welcome! As for your question to see if there any way to get rid of this “a few seconds downtime”? You can minimize the downtime in the configuration. To do that you will have to go into the Heartbeat Stanza and change the configuration of the Interval and the Timeout.

The Interval controls how often to send a heartbeat packet and the Timeout controls the number of intervals after which a node is considered to be missing by the rest of the nodes in the cluster if they haven’t received the heartbeat from missing node. By default they are set as:

interval 150
timeout 10

The formula to find out how many seconds it takes for one node to discover if its missing or to send a heartbeat is by Interval x timeout. So in this case, the default is .150 (milliseconds) x 10 (number of heartbeat intervals to wait before timing out a node) = 1.5 seconds (which is fairly fast) for it to detect if a node is missing or not receiving a heartbeat. You can adjust these two factors to shorten the downtime by changing the

 interval 80

and

timeout 15

and see if that result meet your need.

One more thing to considered though when using cloud is the network variability. Often cloud providers network latency is not consistent over time. This can cause problems with heartbeat packet delivery times. More info can be found on our website http://www.aerospike.com/docs/operations/configure/network/heartbeat/

Hope this helps!

Jerry


#3

Hi Jerry,

Thank you for that.

I’m using configuration values recommended for clusters at Amazon

timeout=20

and

interval=150

which gives 3 sec downtime.

My concern is about these 3 sec. Is there any way to eliminate downtime at all?

Thanks, Max


#4

We expect “0 downtime” too. I believe it is an issue of Client lib settings.

I’m using C Client lib, and running 200K TPS of Write on a 5 nodes Cluster via 80 threads. Turned one node down via “service aerospike stop”, and Client reported 406 Requests failed with error code 500 AEROSPIKE_ERR_SERVER. The other Requests were OK — There were 200K * 1.5s = 300K Requests during that 1.5s detection period, 300K – 406 responded with average latency of 1.1ms / max 1.8s (0.5ms / max 2ms without node down).

Seems those 406 are the Requests already sent to that down node but without responses. Suppose the Client lib shall re-send them with new Hash Map, but not.

After looked into the C Client API Doc, I changed the Client settings:

as_policies-> retry = AS_POLICY_RETRY_ONCE as_policies-> timeout = 4000 (4s which > 1.5s * 2 = 3s)

And finally no failed Requests reported ---- reached “0 downtime”!