We have a quite loaded
replication-factor 3 cluster (currently 2 nodes alive),
C-188.8.131.52, baremetal servers with SSDs. There are multiple sets with different access patterns. One of the sets is just 1k records of simple counters. Regularly, for example,
aerospike.Client.Operate(nil, k, as.AddOp(incrBin), as.GetOp()) on this set returns ‘hot key’/
KEY_BUSY (#14) error when we have some single-record access spikes.
I’ve got an impression (including from here) that
storage-engine memory namespaces doesn’t use transaction queues, so we tried to move this set to a
memory namespace, but it didn’t help - we still get the error.
- why does it happen with
memorynamespace, how to analyze it, are there any metrics?
- is it implemented differently in newer versions? could upgrade possibly help? (hard to upgrade without downtime due to
- what are consequences of raising
transaction-pending-limittoo high? global latency during hot-key spikes? Sometimes it seems that hot-key spikes cause Aerospike even to miss too much of heartbeats and it begins to rebalance (during which any hiccup can cause another rebalance - it can cycle for days). Could it be that we already shouldn’t even raise
transaction-pending-limit(40) and should redesign all the counters-data logic completely instead?
- again - let’s assume that hot-key error spikes (correlated
fail_key_busymetric spikes) freeze Aerospike completely without saturating machine resources - how to analyze it? If there are some group of workers which handle all the stuff in the same time and can be saturated with workload - are there any metrics, benchmark histograms or togglable logging which can help to analyze it?
- (maybe not related to this problem but for example)
tsvc_queueis gone in newer versions - what to use instead? We once got a (likely) real disk overload problem (likely not related to hot keys and seemingly gone after
read-page-cache true), which was correlated with
tsvc_queue- otherwise it seems to be even harder to understand what’s going on when performance degrade wildly (without signs of saturation of machine resources), especially when it happens during migrations.
Sorry if it’s quite messy. Thanks.