Do you have a stateful firewall between any of your clients? There is an bug that 3.12 addresses that causes these symptom in conjunction with a stateful firewall.
If it is a hotkey issue, you should see high traffic/latency/queuing on one node versus the others.
Can you share the nature of the performance issue/errors you mentioned?
Clients and server are on the zone, sharing the same network.
The cluster upgrade is still in progress, but I already have a non uniform balancing for clients.
The latency is higher on nodes which have a high number of connections, but the difference is not huge.
If this is a hot key issue, is there anything to do on aerospike side ?
Do you think than 10k clients for a node is huge ? To say it another way, should I resize the cluster to have more nodes, may be more smaller nodes ? Currently nodes are 32 CPU / 64 G of RAM.
Changing the shape of the cluster seems to fix the issue: 15 small server instead of 7 big, with a global constant footprint. I still have errors, but less.
If you have any hint why the traffic seems to be badly distributed, let me know.
Latency is normal (<5% on 1ms for every operation / every nodes).
For queue depth, I’m not sure what you mean. Rw_in_progress can be spiky (see attached graph), but errors are still localized on two / three nodes on 15, which can be explained by hot keys.
I do not have any error in server log (or I did not find them). Errors are timeout (1s). Currently it’s mainly in one of 15 servers.
Do you know if I can increase the logging level on a specific component to debug ?