Hot key errors even with 'memory' namespace

We have a quite loaded replication-factor 3 cluster (currently 2 nodes alive), C-4.5.0.10, baremetal servers with SSDs. There are multiple sets with different access patterns. One of the sets is just 1k records of simple counters. Regularly, for example, aerospike.Client.Operate(nil, k, as.AddOp(incrBin), as.GetOp()) on this set returns ‘hot key’/ KEY_BUSY (#14) error when we have some single-record access spikes.

I’ve got an impression (including from here) that storage-engine memory namespaces doesn’t use transaction queues, so we tried to move this set to a memory namespace, but it didn’t help - we still get the error.

  1. why does it happen with memory namespace, how to analyze it, are there any metrics?
  2. is it implemented differently in newer versions? could upgrade possibly help? (hard to upgrade without downtime due to delete-related incompatibility)
  3. what are consequences of raising transaction-pending-limit too high? global latency during hot-key spikes? Sometimes it seems that hot-key spikes cause Aerospike even to miss too much of heartbeats and it begins to rebalance (during which any hiccup can cause another rebalance - it can cycle for days). Could it be that we already shouldn’t even raise transaction-pending-limit (40) and should redesign all the counters-data logic completely instead?
  4. again - let’s assume that hot-key error spikes (correlated fail_key_busy metric spikes) freeze Aerospike completely without saturating machine resources - how to analyze it? If there are some group of workers which handle all the stuff in the same time and can be saturated with workload - are there any metrics, benchmark histograms or togglable logging which can help to analyze it?
  5. (maybe not related to this problem but for example) tsvc_queue is gone in newer versions - what to use instead? We once got a (likely) real disk overload problem (likely not related to hot keys and seemingly gone after read-page-cache true), which was correlated with tsvc_queue - otherwise it seems to be even harder to understand what’s going on when performance degrade wildly (without signs of saturation of machine resources), especially when it happens during migrations.

Sorry if it’s quite messy. Thanks.

1 Like

This error is governed by the ’transaction-pending-limit configuration. By default this occurs when you have 20 or more simultaneous requests for the same key.

Aerospike enterprise supports durable-deletes. That said, you should be able to upgrade without downtime if you wipe the storage on each node as you upgrade.

Typically hot-keys cause a few nodes to handle a much higher workload than other nodes in the cluster (depending on the nature of the hotkey).

The heartbeat and transaction systems only share the network hardware. If the cluster is breaking up under high load then your network isn’t keeping up. When you receive this error, do you immediately retry? If so your client is compounding the issue, if these transaction simply cannot fail then it would likely be best to increase this limit until you are able to redesign and reduce the impact of hot-keys.

You could increase your heartbeat.timeout configuration to make the cluster more tolerant of these scenarios. But this heartbeat issue is probably indicating it is time to invest in more nodes or better networking.

Yes, the previous transaction model was potentially a bit easier to debug, the new system processes directly from the kernel’s network stack. We’ve found that we could achieve higher throughput by eliminating these queues. It is possible this improvement could reduce your hotkey problem.

1 Like

BTW, more information about key-busy errors and possible mitigations can be found here:

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.