Reads served only from some nodes while using eventloop

I was trying to compare Aerospike’s java client with synchronous and asynchronous operations. The aerospike cluster’s size was 9 nodes.

I noticed that with synchronous operations, each server node was serving equal number of reads. Whereas, with asynchronous operations(using the eventloop), only 2 nodes were serving all the requests while the other 7 served none. The replication factor is 2.

I kept the same Client policy in both the situations except for the policies related to size of eventloop and maxConnsPerNode. Can someone point out the possible reason why this happens?

The behavior you describe is very peculiar and should not happen. “Only 2 nodes were serving all the requests while the other 7 served none” - this seems like the two nodes are proxying to the other nodes as described here. However obviously sync and async clients should not be different in this manner. Can you describe the request policy details for the two, and also output of ‘show latency’ (or ‘show latencies’ depending on the server version)? What is the number of cores on the client machine, number of eventloops, and maxconnections?

Client machine has 8 cores and the size of the event loop is 16 and maxConnsPerNode is 16*100. Latencies are always less than 1ms.

For the sync client, maxConnsPerNode is default and the following are same for both sync and async clients:

clientPolicy.readPolicyDefault.replica = Replica.MASTER_PROLES; clientPolicy.readPolicyDefault.consistencyLevel = ConsistencyLevel.CONSISTENCY_ONE; clientPolicy.requestProleReplicas = true;

Just to clarify: With the async client, were all the requests directed to and serviced by just two nodes OR only two nodes were active and responding to requests they received (and the remaining requests that would otherwise be serviced by other 7 nodes failed)?

All cluster nodes were active. I even stopped the two nodes which were serving all the reads and then two other nodes started serving all the reads.

This is weird indeed. Is it the exact same keys that are being accessed through this test and are they all succeeding?

I can see only 2 ways this can happen if all the transactions are successful and the throughput is similar between the 2 tests:

  • Somehow the data set for the aysnc test is limited to some specific partitions? Like using only 1 or 2 files from a backup (which is partition based). Those partitions would then be held by 2 nodes, and shutting those 2 nodes would cause those partitions to move.

  • Somehow for the async tests, only 2 nodes are seen and they end up proxying the transactions. But that is hard to imagine happening with the recent clients that should throw a -3 error (node not found for partition).

How are you checking the transactions on the server side? Can you maybe share the output of show latency or show latencies from asadm for both runs?

Also, for the async run, would be interesting to capture the following stats at 2 intervals during the run, so we can see how stats evolve: asadm -e "show stat like client_"

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.