Whole cluster goes down if one node fails to serve the request

Akash_Singh · September 14, 2017, 11:19am

Hi All,

I am using latest aerospike[community 3.14.1.3] with latest java client and facing timeout exceptions with healthy nodes even if those nodes are available.

scenario explanation: i am using a cluster of 4 nodes and the problem occurs when only one node goes down / fails to serve the request against limited number of fd[proto-fd-max]

Due to the failure of only one node my whole application goes down and throws socket timeout exception with all nodes. [WHY ???]

As per my understanding it should discard that dirty node from my cluster and my app still should be able to work [except those threads which are trying to connect the dirty node and need refreshment from java client].

Please output the expected behaviour and let me know if i am doing something wrong.

Albot · September 14, 2017, 10:14pm

Do you have retry defined in your client application’s policy? Why did the node go down? Is this 1 application using aerospike that dies, or all apps?

Brian · September 16, 2017, 1:42am

If one server node has reached proto-fd-max and all nodes have the same proto-fd-max limit, then it’s likely that all nodes are reaching that same limit.

Also, there is a short time-frame (~ 1 second) after node goes down where the cluster blocks writes until new partition map is agreed upon.

Akash_Singh · September 16, 2017, 8:57am

I replicated this issue by lowering down the fd of only one node,in actual scenario it was a hardware failure of one node and due to this failure our cluster took approx 45 min to recover.until then our application was getting continous timeout exception while it was trying to connect the cluster.

Akash_Singh · September 16, 2017, 9:02am

we were also trying to connect our cluster with aql but that was also throwing errors

Akash_Singh · September 21, 2017, 5:57am

Albot/Brian any update?

Brian · September 21, 2017, 5:09pm

Since I know nothing about your application, I would suggest running the java benchmarks against a test cluster. Kill a node and see what happens. The benchmarks programs will get some errors and then recover once the node is removed from the client’s view of the cluster.

Topic		Replies	Views
Java client return timeout once one of the nodes is down Client Libraries java	2	860	May 4, 2022
Handling node failure on client	4	3818	September 23, 2024
Aerospike Exception Operations	4	1199	August 10, 2017
Java Client - Automatic failover to other nodes Java Client	3	910	May 3, 2021
AerospikeException$Timeout while node is stopped	3	1943	January 19, 2015

Whole cluster goes down if one node fails to serve the request

Related topics