Aerospike Java client shows high read times when EC2 instance is created

Hello, I am debugging a scenario where we see high number of Aerospike Java client timeout exceptions.
The client is a Spring Boot REST webservice running in EC2 instances (EBS). These exceptions appear more during the time when an EC2 instance is added to the cluster. The read latency is very high and thus more reads are exceeding the timeout threshold and this exception is thrown.

I wanted to check in this forum if the following are possible causes:

  1. Does the Java client have a cache warming phase? I think this is unlikely, but wanted to check.

  2. The REST service was writing and reading the data. To reduce the load, we have moved the write operation to a Spark job on AWS EMR that writes to Aerospike. I started seeing this issue after moving the write operations to this EMR Spark cluster. Could read latency be affected if a large dataset is added to Aerospike outside the Java client?

Please give any suggestions to tackle this. The namespace configuration is:

namespace t1 {
        replication-factor 2
        memory-size 25G
        high-water-memory-pct 70
        high-water-disk-pct 60
        default-ttl 4d 
        single-bin true
        partition-tree-sprigs 4096
        storage-engine memory

Here is the full stacktrace.

com.aerospike.client.AerospikeException$Timeout: Client timeout: timeout=30 iterations=1 lastNode=BB90B6F2699290E 3000
at com.aerospike.client.async.NettyCommand.totalTimeout(
at com.aerospike.client.async.NettyCommand.timeout(
at com.aerospike.client.async.HashedWheelTimer$HashedWheelTimeout.expire(
at com.aerospike.client.async.HashedWheelTimer$HashedWheelTimeout.access$700(
at com.aerospike.client.async.HashedWheelTimer$HashedWheelBucket.expireTimeouts(
at io.netty.util.concurrent.PromiseTask$
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
at io.netty.util.concurrent.SingleThreadEventExecutor$
at io.netty.util.concurrent.DefaultThreadFactory$

Thanks for reading.

Here is some feedback on those questions:

1- The Java client does not have a warming phase. Your guess is correct. Having said that, when a node is added to a cluster, the first few transactions would require new connections to be created, which would add to the total time for a transaction to be processed. The recent versions of the Java client have an extra policy parameter for the connectionTimeout to exclude this extra time from the total timeout.

2- That is a pretty general question. Looking at the logs (specifically at the benchmark histograms for reads and writes) could provide some answers. But, in theory, having a higher workload of write transactions (or simply a different shape – meaning different number of connections, record sizes, etc…) could slow down server nodes and potentially impact read latencies.

1 Like

Thanks for the responses.

The high Aerospike timeouts was linked to higher JVM memory settings. The EC2 instance type is m5a.2xlarge (8 vCPU, 32GB RAM). The JVM memory allocation was -Xms16g -Xmx16g -XX:MaxPermSize=1g. When this was increased to -Xms23g -Xmx23g -XX:MaxPermSize=1g there are higher number of AerospikeException$Timeout.

I am not yet sure why providing more RAM to the JVM has caused these errors. Any notes/observations by anyone who has seen a similar problem will be helpful.

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.