Client distribution not uniform

Hello,

My clients seems to be not well balanced accross my cluster (see AMC screenshot attached). I’m using the Go client. Data seems to be balanced.

I have performance issue (errors, probably because too many read write in progress).

Is there any solution to diagnose what is the issue ? Find the hot keys if this is the problem ? How to improve performances ?

Thx

Do you have a stateful firewall between any of your clients? There is an bug that 3.12 addresses that causes these symptom in conjunction with a stateful firewall. If it is a hotkey issue, you should see high traffic/latency/queuing on one node versus the others. Can you share the nature of the performance issue/errors you mentioned?

I’m on GCP, I do not have anything specific between nodes.

Problems I have: write error spikes. not huge, but often during a peak of r/w in progress on the node which has lot of connections (.39):

I will try an upgrade to 3.12.1. Let me know if you have others ideas.

Not between the nodes, between the cluster and the clients.

Clients and server are on the zone, sharing the same network.

The cluster upgrade is still in progress, but I already have a non uniform balancing for clients. The latency is higher on nodes which have a high number of connections, but the difference is not huge.

If this is a hot key issue, is there anything to do on aerospike side ?

Do you think than 10k clients for a node is huge ? To say it another way, should I resize the cluster to have more nodes, may be more smaller nodes ? Currently nodes are 32 CPU / 64 G of RAM.

Thx.

Changing the shape of the cluster seems to fix the issue: 15 small server instead of 7 big, with a global constant footprint. I still have errors, but less.

If you have any hint why the traffic seems to be badly distributed, let me know.

Well i did mention you could check your histograms and your queue depths historically…

Latency is normal (<5% on 1ms for every operation / every nodes).

For queue depth, I’m not sure what you mean. Rw_in_progress can be spiky (see attached graph), but errors are still localized on two / three nodes on 15, which can be explained by hot keys.

http://www.aerospike.com/docs/reference/serverlogmessages/ can you post some logs and let us know the time stamp you experienced the issue so we can cross reference the logs

I do not have any error in server log (or I did not find them). Errors are timeout (1s). Currently it’s mainly in one of 15 servers. Do you know if I can increase the logging level on a specific component to debug ?