Whole cluster goes down if one node fails to serve the request


#1

Hi All,

I am using latest aerospike[community 3.14.1.3] with latest java client and facing timeout exceptions with healthy nodes even if those nodes are available.

scenario explanation: i am using a cluster of 4 nodes and the problem occurs when only one node goes down / fails to serve the request against limited number of fd[proto-fd-max]

Due to the failure of only one node my whole application goes down and throws socket timeout exception with all nodes. [WHY ???]

As per my understanding it should discard that dirty node from my cluster and my app still should be able to work [except those threads which are trying to connect the dirty node and need refreshment from java client].

Please output the expected behaviour and let me know if i am doing something wrong.


#2

Do you have retry defined in your client application’s policy? Why did the node go down? Is this 1 application using aerospike that dies, or all apps?


#3

If one server node has reached proto-fd-max and all nodes have the same proto-fd-max limit, then it’s likely that all nodes are reaching that same limit.

Also, there is a short time-frame (~ 1 second) after node goes down where the cluster blocks writes until new partition map is agreed upon.


#4

I replicated this issue by lowering down the fd of only one node,in actual scenario it was a hardware failure of one node and due to this failure our cluster took approx 45 min to recover.until then our application was getting continous timeout exception while it was trying to connect the cluster.


#5

we were also trying to connect our cluster with aql but that was also throwing errors


#6

Albot/Brian any update?


#7

Since I know nothing about your application, I would suggest running the java benchmarks against a test cluster. Kill a node and see what happens. The benchmarks programs will get some errors and then recover once the node is removed from the client’s view of the cluster.