Handling node failure on client


#1

I’ve got simple 3-node cluster with in-memory namespace (replication-factor 2). Aerospike community edition 3.8.1

My application (Go) makes queries against this cluster at a some rate. When i gracefully shut down Aerospike daemon on one of the nodes the Aerospike client produces many errors and transactions are lost. The errors are EOF, Timeout, Retry count and like.

But docs states that client libraries should handle this situation gracefully - transparently retry requests on other nodes. Am i doing something wrong? How to handle this situation correctly.

I’ve tried Python client - the same.


#2

Let’s go over what happens when a node is down, whether you shut down asd, or the entire machine goes down.

  1. The other nodes of the cluster will continue to send heartbeats to the downed node, until those fail a specified number of consecutive times. This takes timeout * interval milliseconds. By default these are set to take 10 * 150ms = 1500ms.
  2. A Paxos vote elects a leader, then a new partition map is determined by it and communicated to all the remaining nodes in the cluster. The length of this process depends on the cluster size and network, but let’s assume it’s around 50ms.
  3. At the next tend interval the client will identify that a new cluster has formed, ask for the IP of the nodes in the cluster, and pull down the new partition map. Assume it may take up to TendInterval in milliseconds, which by default is 1000ms. The Java, C, C# clients also allow for the tend interval to be configured. The Python does not yet, so it’s set at 1s. You probably don’t want to lower this value to less than 250ms.

Until the client has the new partition map, all reads and writes that are directed at the downed node will experience timeouts. The knowledge base article Understanding Timeout Errors covers this in more detail.

It is the application’s logic that needs to decide how to handle the exceptions stemming from these timeouts.

  • You can choose to accept that the operation failed and continue processing.
  • You can use the retry policies, provided by most clients, to have the client wait and retry the operation a few times. In the Go client the base policy object provides the attributes MaxRetries and SleepBetweenRetries. You may want to set the sleep period to be around the TendInterval and the number of retries equal to the period of timeout you expect. I suggest to only use these policies in the exception handler. Most of your writes will not need this behavior.
  • You can implement a similar wait and retry behavior in your application.

#3

Thanks! It’s now much clearer for me how it works.

I think i was confused by this FAQ section:

If a node goes down during a read or write, what happens?

It is possible that a node will go down during a read or write operation. The client API will try to connect to the database but not get a response. When this happens, the client API will automatically try the secondary database. From a coding standpoint, you do not need to be aware of this as the API handles additional attempts at communicating with the database.

So it made me believe that i shouldn’t handle node leave events in my app, client API will do it. Maybe you should clarify this section so people don’t get confused.

And thanks for the timeout errors FAQ, it seems that i’ve missed it. I thought that Timeout policy’s attribute governs each retry timeout, and not the whole request - and seen strange results :slight_smile: Now it’s clear.