WriteTimeout ignored by client.execute


#1

I am having some issues with the timeout specified getting ignored and as a result an AerospikeException due to timeout.

Specifically:

  • Using client version 3.0.32 (have had issues with upgrades in the past)
  • Specifically this occurs using the LargeSet udf, and the add function. The sets consist of unique ID’s. I’m aware that LargeSet is being pushed out in the latest server releases(which is part of why we’re somewhat dubious about moving to it as it is not a painless process)
  • I have followed with the debugger down to SyncCommand.execute(). At this point the WritePolicy.timeout value is still correct. Beyond there the client side timeout operates off localtimeGetMillis() so trying to debug in real time doesn’t seem workable.

Here is the relevant code

public boolean writeSet(String date, String appPrefix, String setKey, Set<String> contents) {
  Key key = getAeroSetKey(date, appPrefix, setKey);
  Value[] values = new Value[contents.size()];
  int i=0;
  for(String curVal : contents) {
    values[i++] = Value.get(curVal);
  }
  try {
    LargeSet lset = new LargeSet(client, writePolicy, key, "lset", "small_set");
    lset.destroy();
    log.debug("writePolicy has timeout of " + writePolicy.timeout);
    lset.add(values);
  } catch (RuntimeException e) {
    log.error("WriteSet failed: " + e.getMessage(), e);
    return false;
  }
  return true;
}

Other commands appear to respect the timeout set in the policy

edit 2015/05/15: it appears that the data is in fact getting succesfully added and that the timeout error is in error?

16:16:26.103 [main] INFO j.c.recommend.cli.PartitionTask - Writing partition consisting of 18889 16:16:43.471 [main] ERROR j.c.r.cli.AerospikeOperations - WriteSet failed: Error Code 9: Timeout com.aerospike.client.AerospikeException: Error Code 9: Timeout at com.aerospike.client.command.ReadCommand.parseResult(ReadCommand.java:108) at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:56)

Upon checking the db, the data is actually there:

aql> execute lset.size('lset') on guildmatch.tenkuri20150515 where pk='guild-partitions-1-3'
+-------+
| size  |
+-------+
| 18889 |
+-------+

#2

What timeout did you set?


#3

Several of the timeouts are set, however for this case the relevant WritePolicy.timeout is set, and we have verified with the debugger that at SyncCommand.execute that it is correctly set to be 60seconds(very extreme I am aware, these are moderately large batch inserts into an LDT map).

Is there anything on the server side that might overwrite this rather extreme time out setting?

I said earlier that one with the sets even though it timed out, the data had actually been written. This appears to be somewhat arbitrary as with LargeMap.put() we have the same problem(of timing out), and once it does time out, the data hasn’t been inserted


#4

There is a server service level config called “transaction-max-ms” that defaults to 1000 (1 second). I suspect your 60 second timeout is being ignored and replaced by 1 second. You can increase “transaction-max-ms” to obtain higher timeouts.


#5

I will try that and get back to here.

One interesting thing I found:

this problem only appears to occur in distributed clusters. I was somewhat confused when it turned up as during development I had not run into it.

I created a simple test case and ran it against both a single node cluster and a multi-node cluster.

This is the result in a single node cluster:

19:02:19.614 [main] INFO j.c.r.cli.AerospikeOperationsTest - Took 2516 milis to store full map

This is the result against a simple 2-node test cluster created to verify the issue:

com.aerospike.client.AerospikeException: Error Code 9: Timeout
at com.aerospike.client.command.ReadCommand.parseResult(ReadCommand.java:108)
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:56)
at com.aerospike.client.AerospikeClient.execute(AerospikeClient.java:1033)
at com.aerospike.client.large.LargeMap.put(LargeMap.java:72)
at -redacted-.AerospikeOperations.storeLargeMap(AerospikeOperations.java:697)
at -redacted-.AerospikeOperationsTest.testLargeMap(AerospikeOperationsTest.java:228)

And for comparison sake, with the single node, but writeTimeout set to a short time that will result in a genuine timeout:

com.aerospike.client.AerospikeException$Timeout: Client timeout: timeout=1000 iterations=1 failedNodes=0 failedConns=0
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:131)
at com.aerospike.client.AerospikeClient.execute(AerospikeClient.java:1033)
at com.aerospike.client.large.LargeMap.put(LargeMap.java:72)
at -redacted-.AerospikeOperations.storeLargeMap(AerospikeOperations.java:697)


#6

Changing transaction-max-ms did indeed fix this issue.

I followed the server source around a bit and found the standard path but couldn’t identify where the tr->end_time would get overwritten, though I did notice the use of end_time = start_time + the transaction_max_ms

I suppose somewhere in coordinating the cluster that the timeout is lost and the default is used instead since the single node used it correctly…


#7

The server is known to use its own internal timeouts (usually 1 second) when communicating with other server nodes to complete the transaction. One such example is when requested data does not exist on the server and a proxy is made to the server that really contains the data.