3 node Aerospike cluster v3.5.15 collocated with 3 Aerospike Java clients v3.1.4.
I have encountered a situation where where two nodes fail and the surviving client indefinitely attempts to query the failed nodes and constantly fails with timeout exceptions.
This scenario cannot be easily reproduced. Normally when a node goes down there are several ‘refresh failed’ messages followed by ‘Remove node’ on the client side and then partition changes are applied and the client uses the new DHT.
[2016-01-06 08:41:26,436] [tend] Node BB9023F013E16FA 10.0.42.61:3000 refresh failed: com.aerospike.client.AerospikeException: java.net.SocketException: Connection reset
[2016-01-06 08:41:26,687] [tend] Node BB9023F013E16FA 10.0.42.61:3000 refresh failed: Error Code 11: java.net.ConnectException: Connection refused
[2016-01-06 08:41:26,943] [tend] Node BB9023F013E16FA 10.0.42.61:3000 refresh failed: Error Code 11: java.net.ConnectException: Connection refused
[2016-01-06 08:41:26,943] [tend] Remove node BB9023F013E16FA 10.0.42.61:3000
In this specific situation I constantly see ‘refresh failed’ messages but never the ‘remove node’ messages on neither of the failed nodes.
[2015-12-22 16:29:47,151] [tend] Node BB96BF55F3E16FA 10.0.42.61:3000 refresh failed: java.io.EOFException
[2015-12-22 16:29:47,402] [tend] Node BB96BF55F3E16FA 10.0.42.61:3000 refresh failed: Error Code 11: java.net.ConnectException: Connection refused
[2015-12-22 16:29:47,658] [tend] Node BB96BF55F3E16FA 10.0.42.61:3000 refresh failed: Error Code 11: java.net.ConnectException: Connection refused
[2015-12-22 16:29:47,679] [tend] Node BB95C2D043E16FA 10.0.42.63:3000 refresh failed: com.aerospike.client.AerospikeException: java.net.SocketException: Connection reset
at com.aerospike.client.Info.sendCommand(Info.java:468)
at com.aerospike.client.Info.<init>(Info.java:123)
at com.aerospike.client.Info.request(Info.java:408)
at com.aerospike.client.cluster.Node.refresh(Node.java:90)
at com.aerospike.client.cluster.Cluster.tend(Cluster.java:284)
at com.aerospike.client.cluster.Cluster.run(Cluster.java:250)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at com.aerospike.client.cluster.Connection.readFully(Connection.java:89)
at com.aerospike.client.Info.sendCommand(Info.java:459)
... 6 more
[2015-12-22 16:29:47,930] [tend] Node BB96BF55F3E16FA 10.0.42.61:3000 refresh failed: Error Code 11: java.net.ConnectException: Connection refused
[2015-12-22 16:29:47,930] [tend] Node BB95C2D043E16FA 10.0.42.63:3000 refresh failed: Error Code 11: java.net.ConnectException: Connection refused
Later the following message appear
[2015-12-22 16:29:53,190] [tend] Node BB920E7563E16FA 10.0.42.62:3000 thinks it owns cluster, but client sees 3 nodes.
All queries from that moment on are still attempted on a partition based on the original DHT before failure and fail on timeout
Client timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=1 lastNode=BB96BF55F3E16FA 10.0.42.61:3000
Client timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=1 lastNode=BB95C2D043E16FA 10.0.42.63:3000
The client can’t recover from this situation. The ‘refresh failed’, ‘client sees 3 nodes’ and client timeout last indefinitely.
Any idea why was this happening and what to do to verify it will not happen again?