Client distribution not uniform

Foo34 · May 29, 2017, 2:40am

Hello,

My clients seems to be not well balanced accross my cluster (see AMC screenshot attached). I’m using the Go client. Data seems to be balanced.

I have performance issue (errors, probably because too many read write in progress).

Is there any solution to diagnose what is the issue ? Find the hot keys if this is the problem ? How to improve performances ?

Thx

Albot · May 29, 2017, 2:53am

Do you have a stateful firewall between any of your clients? There is an bug that 3.12 addresses that causes these symptom in conjunction with a stateful firewall. If it is a hotkey issue, you should see high traffic/latency/queuing on one node versus the others. Can you share the nature of the performance issue/errors you mentioned?

Foo34 · May 29, 2017, 3:01am

I’m on GCP, I do not have anything specific between nodes.

Problems I have: write error spikes. not huge, but often during a peak of r/w in progress on the node which has lot of connections (.39):

I will try an upgrade to 3.12.1. Let me know if you have others ideas.

Albot · May 29, 2017, 4:34am

Not between the nodes, between the cluster and the clients.

Foo34 · May 29, 2017, 6:57pm

Clients and server are on the zone, sharing the same network.

The cluster upgrade is still in progress, but I already have a non uniform balancing for clients. The latency is higher on nodes which have a high number of connections, but the difference is not huge.

If this is a hot key issue, is there anything to do on aerospike side ?

Do you think than 10k clients for a node is huge ? To say it another way, should I resize the cluster to have more nodes, may be more smaller nodes ? Currently nodes are 32 CPU / 64 G of RAM.

Thx.

Foo34 · May 31, 2017, 2:24am

Changing the shape of the cluster seems to fix the issue: 15 small server instead of 7 big, with a global constant footprint. I still have errors, but less.

If you have any hint why the traffic seems to be badly distributed, let me know.

Albot · May 31, 2017, 2:32am

Well i did mention you could check your histograms and your queue depths historically…

Foo34 · May 31, 2017, 2:56am

Latency is normal (<5% on 1ms for every operation / every nodes).

For queue depth, I’m not sure what you mean. Rw_in_progress can be spiky (see attached graph), but errors are still localized on two / three nodes on 15, which can be explained by hot keys.

Albot · May 31, 2017, 11:28pm

http://www.aerospike.com/docs/reference/serverlogmessages/ can you post some logs and let us know the time stamp you experienced the issue so we can cross reference the logs

Foo34 · June 8, 2017, 1:50pm

I do not have any error in server log (or I did not find them). Errors are timeout (1s). Currently it’s mainly in one of 15 servers. Do you know if I can increase the logging level on a specific component to debug ?

Topic		Replies	Views
Network traffic distribution	2	2087	June 24, 2015
Aerospike un-even CPU distribution Configuration	2	55	July 2, 2024
Cluster synchronization: re-write keys Tuning	7	4681	August 18, 2014
CPU unusually high on one node of 8 node cluster	10	2928	April 17, 2017
Dramatic increase of client_connections/waiting_transactions	2	1525	December 18, 2014

Client distribution not uniform

Related topics