Client distribution not uniform


#1

Hello,

My clients seems to be not well balanced accross my cluster (see AMC screenshot attached). I’m using the Go client. Data seems to be balanced.

I have performance issue (errors, probably because too many read write in progress).

Is there any solution to diagnose what is the issue ? Find the hot keys if this is the problem ? How to improve performances ?

Thx


#2

Do you have a stateful firewall between any of your clients? There is an bug that 3.12 addresses that causes these symptom in conjunction with a stateful firewall. If it is a hotkey issue, you should see high traffic/latency/queuing on one node versus the others. Can you share the nature of the performance issue/errors you mentioned?


#3

I’m on GCP, I do not have anything specific between nodes.

Problems I have: write error spikes. not huge, but often during a peak of r/w in progress on the node which has lot of connections (.39):

I will try an upgrade to 3.12.1. Let me know if you have others ideas.


#4

Not between the nodes, between the cluster and the clients.


#5

Clients and server are on the zone, sharing the same network.

The cluster upgrade is still in progress, but I already have a non uniform balancing for clients. The latency is higher on nodes which have a high number of connections, but the difference is not huge.

If this is a hot key issue, is there anything to do on aerospike side ?

Do you think than 10k clients for a node is huge ? To say it another way, should I resize the cluster to have more nodes, may be more smaller nodes ? Currently nodes are 32 CPU / 64 G of RAM.

Thx.


#6

Changing the shape of the cluster seems to fix the issue: 15 small server instead of 7 big, with a global constant footprint. I still have errors, but less.

If you have any hint why the traffic seems to be badly distributed, let me know.


#7

Well i did mention you could check your histograms and your queue depths historically…


#8

Latency is normal (<5% on 1ms for every operation / every nodes).

For queue depth, I’m not sure what you mean. Rw_in_progress can be spiky (see attached graph), but errors are still localized on two / three nodes on 15, which can be explained by hot keys.


#9

http://www.aerospike.com/docs/reference/serverlogmessages/ can you post some logs and let us know the time stamp you experienced the issue so we can cross reference the logs


#10

I do not have any error in server log (or I did not find them). Errors are timeout (1s). Currently it’s mainly in one of 15 servers. Do you know if I can increase the logging level on a specific component to debug ?