Handling node failure on client

Let’s go over what happens when a node is down, whether you shut down asd, or the entire machine goes down.

  1. The other nodes of the cluster will continue to send heartbeats to the downed node, until those fail a specified number of consecutive times. This takes timeout * interval milliseconds. By default these are set to take 10 * 150ms = 1500ms.
  2. A Paxos vote elects a leader, then a new partition map is determined by it and communicated to all the remaining nodes in the cluster. The length of this process depends on the cluster size and network, but let’s assume it’s around 50ms.
  3. At the next tend interval the client will identify that a new cluster has formed, ask for the IP of the nodes in the cluster, and pull down the new partition map. Assume it may take up to TendInterval in milliseconds, which by default is 1000ms. The Java, C, C# clients also allow for the tend interval to be configured. The Python does not yet, so it’s set at 1s. You probably don’t want to lower this value to less than 250ms.

Until the client has the new partition map, all reads and writes that are directed at the downed node will experience timeouts. The knowledge base article Understanding Timeout Errors covers this in more detail.

It is the application’s logic that needs to decide how to handle the exceptions stemming from these timeouts.

  • You can choose to accept that the operation failed and continue processing.
  • You can use the retry policies, provided by most clients, to have the client wait and retry the operation a few times. In the Go client the base policy object provides the attributes MaxRetries and SleepBetweenRetries. You may want to set the sleep period to be around the TendInterval and the number of retries equal to the period of timeout you expect. I suggest to only use these policies in the exception handler. Most of your writes will not need this behavior.
  • You can implement a similar wait and retry behavior in your application.