I am using latest aerospike[community 3.14.1.3] with latest java client and facing timeout exceptions with healthy nodes even if those nodes are available.
scenario explanation:
i am using a cluster of 4 nodes and the problem occurs when only one node goes down / fails to serve the request against limited number of fd[proto-fd-max]
Due to the failure of only one node my whole application goes down and throws socket timeout exception with all nodes. [WHY ???]
As per my understanding it should discard that dirty node from my cluster and my app still should be able
to work [except those threads which are trying to connect the dirty node and need refreshment from java client].
Please output the expected behaviour and let me know if i am doing something wrong.
If one server node has reached proto-fd-max and all nodes have the same proto-fd-max limit, then it’s likely that all nodes are reaching that same limit.
Also, there is a short time-frame (~ 1 second) after node goes down where the cluster blocks writes until new partition map is agreed upon.
I replicated this issue by lowering down the fd of only one node,in actual scenario it was a hardware failure of one node and due to this failure our cluster took approx 45 min to recover.until then our application was getting continous timeout exception while it was trying to connect the cluster.
Since I know nothing about your application, I would suggest running the java benchmarks against a test cluster. Kill a node and see what happens. The benchmarks programs will get some errors and then recover once the node is removed from the client’s view of the cluster.