AerospikeException$Timeout while node is stopped

I run an Aerospike cluster v3.4 with two nodes. The cluster contains only 19 records (this is just a test instance :slight_smile: ). Using the Java Client API, I wrote a small application that queries all these data periodically:

Statement statement = new Statement();
statement.setNamespace("testNs");
statement.setSetName("testSet");
        RecordSet rs = client.query(null, statement);
        while (rs.next()) {} ...

This work fine usually, but if I stop one of the working nodes while running the query periodically, I will get an exception:

Exception in thread "main" com.aerospike.client.AerospikeException$Timeout: Client timeout: timeout=30000 iterations=26 failedNodes=0 failedConns=25
    	at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:131)
    	at com.aerospike.client.query.QueryExecutor$QueryThread.run(QueryExecutor.java:134)

I tried to prevent this exception, and set the policy settings to a higher timeout, more retries and so on but I could not adjust those settings in a way to prevent this Exception and just receive the query results (maybe much slower, but that is stilll ok).

Is it possible to set the policy so that the queries survive a stopped or failed node?

Configuration has mesh topology set up and only in-memory storage.

I have also tried to run the query with the aql command line client, in this case I got

Error: (11) AEROSPIKE_ERR_CLUSTER

error messages.

I ran some tests and as I can see, the query operation will until the policy’s timeout expires or if all the retries fail. If a node is down because of maintenance or an outage, this will more likely to happen and the AerospikeException will be thrown.

The only solution I see now is to cancel the application’s operation and simple retry the query again.

Is there any other may to workaround this behaviour and get the correct result from the query/rs.get calls?

If a node goes down while the query is running, the query will fail because there is no retry on queries by default. The client will eject the downed node from it’s map within a second. After that, queries should work with the remaining nodes.