Hot Key error code 14

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Problem Description

Client API is returning Error code 14 or Client is logging KEY_BUSY result code messages.

ERROR-selector0-2016-10-26 00:00:00,002-Console-selector0—2016-10-26 00:00:00 UTC ERROR com.aerospike.client.AerospikeException: Error Code 14: Hot key

Symptoms

A hot-key residing on a node can potentially have the following side-effects on the node owning the hotkey:

  • Higher number of client connections on a single node. Or high connections moving amongst the nodes if a node with high connections is removed.
  • Higher number of transactions (ops/second) seen on a single node.
  • Uneven device utilization (check iostat) or network utilization (check sar or other monitoring tools).
  • Higher latencies.

Explanation

A hot-key is defined when a high rate of read/updates are being done on certain key. When the server node received too many concurrent (or quasi-concurrent) operation requests for the same key, the server rejects the request and returns KEY_BUSY. The server also increments the fail_key_busy count statistic. This condition only applies to transactions that would make use of the rw-hash.

Note: the rw-hash is the structure used to park transactions that need to consult another node prior to returning to the client. During this time, other transactions (see bullet points below) could be queued up behind the transaction that is in progress and will be restarted once the pending transaction completes). Here are transactions that would get into the rw-hash:

  • write transactions.
  • read transactions if duplicate resolution is required (only when migrations are going on).
  • for strong consistency enabled namespaces, if a write transaction is in progress, read transactions will also be parked in the rw-hash (also for re-replication if necessary).

Hot-keys on read transactions are not directly monitored by server statistics unless these are read transactions where duplicate resolution is needed (during ongoing migrations), thus would need additional logging on the client side to identify frequently accessed keys for reads on a stable cluster. Alternatively, a special log context (rw-client) logs (as of version 3.16) every client transaction, making it easier to identify potential hot-keys.

Verification

One potential symptoms of hotkeys would be a high and increasing number in the fail_key_busy statistic, which instantaneously shows the number of transactions failing on ‘hot-keys’.

asadm > watch 2 show statistics like fail_key_busy

For server versions 3.16.0.1 and above, the following log-line will also appear on the node that has the hot-key along with the statistics increment each time a key reaches the (default) 20 pending write requests (or read doing duplicate resolution).

Apr 26 2018 20:50:06 GMT: INFO (info): (ticker.c:795) {namespace} special-errors: key-busy 200 record-too-big 0

Identifying the Hot-Key

In order to investigate further into the source of the hot-key, you can enable detail level logging for the context rw-client context.

Note: Usually, only write transactions are susceptible to fail with key busy, but read transactions could also encounter such failure if read.consistency_level is set to all (which would trigger duplicate resolution during migrations) or for strong consistency enabled namespaces. Otherwise, in AP namespaces, read transactions (keeping default consistency level) would still proceed independently of write transactions having reached the pending limit.

  1. Once identified, the AQL explain command should help with the mapping of digest to a Node ID or to identify the namespace/set. But, the symptoms mentioned above might already identify the node in question.
aql> explain select * from test.testset where PK=1
+-----------+-------------------------------------------------------------+-----------+-----------+--------+---------+-----------+----------------------------+-------------------+-------------------------+---------+
| SET       |                    DIGEST                                   | NAMESPACE | PARTITION | STATUS | UDF     | KEY_TYPE  | POLICY_REPLICA             | NODE              | POLICY_KEY              | TIMEOUT |
+-----------+-------------------------------------------------------------+-----------+-----------+--------+---------+-----------+----------------------------+-------------------+-------------------------+---------+
| "testset" | F5 91 24 98 6E 96 AD 17 5B 37 4C 94 87 94 5B BC AD 53 7B 74 | "test"    | 501       | 0      | "FALSE" | "INTEGER" | "AS_POLICY_REPLICA_MASTER" | "BB9A2DD34270008" | "AS_POLICY_KEY_DEFAULT" | 1000    |
+-----------+-------------------------------------------------------------+-----------+-----------+--------+---------+-----------+----------------------------+-------------------+-------------------------+---------+
  1. To identify the Primary Key from the digest, refer to question 7 of the FAQ on Keys and Digest.

  2. For another way to identify the Set that contains the Hot-key from the digest, refer to the article on Identifying Set Name from a Digest.

  3. A complex approach (was more prevalent prior to the detail level logging available for server versions 3.16 and above) is to use a TCP dump and a Linux sorting formatted analysis to identify the key. Refer to the TCP Dump analysis article for details.

Solution

There is no direct solution if a server is expected to handle hot-keys. It is important to catch these on the client and consider a revisit at the application design level by potentially splitting the keys in multiple records.

Note: for hotkeys on read transactions, turning on the read-page-cache may significantly help (available in versions 4.3.1 and above).

Temporary mitigation

By default, Aerospike maintains a maximum pending transactions of 20 that can be queued up to work on the same key. If you want to reduce the number of errors as a workaround, you can increase the limit transaction-pending-limit on one or all the nodes.

Note, there may be a performance issue for increasing the value of transaction-pending-limit in the case of hotkeys. This command sets limit to 25 from current default of 20.

Admin> asinfo -v "set-config:context=namespace;id=namespaceName;transaction-pending-limit=25
ok
 

In versions prior to Aerospike 4.3.1 the transaction-pending-limit was part of the service context rather than namespace and as such, in those releases the command would be:

Admin> asinfo -v "set-config:context=service;transaction-pending-limit=3"
ok

Depending on the impact (or even to confirm if write hot-keys exist on the system), one could potentially mitigate the issue differently by lowering the transaction-pending-limit to 1 or 2 (setting it to 0 will set the queue to infinite so 1 should be the lowest value to attempt). Note that this would generate higher errors on the client side but this could lower the intensity of impact on the server side.

Potential workaround for distributing read load

Read Replica Policy The read.replica read replica policy applies to read operations and specifies which replica for the client to access during the read operation:

  • master (default) — Read the master replica.
  • any — Read an arbitrary replica (effectively a random replica).

NOTE: If you plan to use this solution for reads using lower data consistency guarantees, a later read may return an earlier value for the record. Please review the following document for a detailed explanation. See: Per-Transaction Consistency Guarantees – https://www.aerospike.com/docs/architecture/consistency.html

Use fire and forget (version 3.16 and above)

Finally, if the use case allows it, setting the write-commit-level-override to master would act as a true fire and forget when executing the write on the replica side (only for version 3.16 and up) and limit the backup of transactions on the rw-hash.

Keywords

HOTKEY FAIL_KEY_BUSY RW_HASH

Timestamp

June 2020