Hot Key error code 14


#1

Problem Description

Client API is returning Error code 14 or Client is logging KEY_BUSY result code messages.

ERROR-selector0-2016-10-26 00:00:00,002-Console-selector0—2016-10-26 00:00:00 UTC ERROR com.aerospike.client.AerospikeException: Error Code 14: Hot key

Symptoms

A hot-key residing on a node can potentially have the following side-effects on the node owning the hotkey:

  • Higher number of client connections on a single node. Or high connections moving amongst the nodes if a node with high connections is removed.
  • Higher number of transactions (ops/second) seen on a single node.
  • Uneven device utilization (check iostat) or network utilization (check sar or other monitoring tools).
  • Higher latencies.

Explanation

A hot-key is defined when a high rate of read/updates are being done on certain key. When the server node received too many concurrent (or quasi-concurrent) operation requests for the same key, the server rejects the request and returns KEY_BUSY. The server also increments the fail_key_busy count statistic. This condition only applies to transactions that would make use of the rw-hash (the structure used to park transactions that need to consult another node prior to returning to the client therefore typically write or read transactions requiring duplicate resolution (during migration)).

Hot-keys on read transactions are not directly monitoried by server statistics unless these are read transactions where duplicate resolution is needed (during ongoing migrations), thus would need additional logging on the client side to identify frequently accessed keys for reads on a stable cluster. Alternatively, a special log context (rw-client) logs (as of version 3.16) every client transaction, making it easier to identify potential hot-keys.

Verification

One potential symptoms of hotkeys would be a high and increasing number in the fail_key_busy statistic, which instantaneously shows the number of transactions failing on ‘hot-keys’.

asadm > watch 2 show statistics like fail_key_busy

For server versions 3.16.0.1 and above, the following log-line will also appear on the node that has the hot-key along with the statistics increment each time a key reaches the (default) 20 pending write requests (or read doing duplicate resolution).

Apr 26 2018 20:50:06 GMT: INFO (info): (ticker.c:795) {namespace} special-errors: key-busy 200 record-too-big 0

In order to investigate further into the source of the hot-key, you can enable detail level logging for the context rw-client context. For more details on context and log-level, refer to this knowledge base article.

Note: Usually, only write transactions are susceptible to fail with key busy, but read transactions could also encounter such failure if read.consistency_level is set to all (which would trigger duplicate resolution during migrations). Otherwise, in AP namespaces, read transactions (keeping default consistency level) would still proceed independently of write transactions having reached the pending limit.

  1. AQL explain command might help with the mapping of key to a Node ID.
aql> explain select * from test.testset where PK=1
+-----------+-------------------------------------------------------------+-----------+-----------+--------+---------+-----------+----------------------------+-------------------+-------------------------+---------+
| SET       |                    DIGEST                                   | NAMESPACE | PARTITION | STATUS | UDF     | KEY_TYPE  | POLICY_REPLICA             | NODE              | POLICY_KEY              | TIMEOUT |
+-----------+-------------------------------------------------------------+-----------+-----------+--------+---------+-----------+----------------------------+-------------------+-------------------------+---------+
| "testset" | F5 91 24 98 6E 96 AD 17 5B 37 4C 94 87 94 5B BC AD 53 7B 74 | "test"    | 501       | 0      | "FALSE" | "INTEGER" | "AS_POLICY_REPLICA_MASTER" | "BB9A2DD34270008" | "AS_POLICY_KEY_DEFAULT" | 1000    |
+-----------+-------------------------------------------------------------+-----------+-----------+--------+---------+-----------+----------------------------+-------------------+-------------------------+---------+

Solution

There is no direct solution if a server is expected to handle hot-keys. It is important to catch these on the client and consider a revisit at the application design level by potentially splitting the keys in multiple records.

Temporary mitigation

By default, Aerospike maintains a maximum pending transactions of 20 that can be queued up to work on the same key. If you want to reduce the number of errors as a workaround, you can increase the limit transaction-pending-limit on one or all the nodes.

Note, there may be a performance issue for increasing the value of transaction-pending-limit in the case of hotkeys. This command sets limit to 25 from current default of 20.

asadm> asinfo -v "set-config:context=service;transaction-pending-limit=25;" 

Depending on the impact (or even to confirm if write hot-keys exist on the system), one could potentially mitigate the issue differently by lowering the transaction-pending-limit to 1 or 2 (setting it to 0 will set the queue to infinite so 1 should be the lowest value to attempt). Note that this would generate higher errors on the client side but this could lower the intensity of impact on the server side.

Potential workaround for distributing read load

Read Replica Policy The read.replica read replica policy applies to read operations and specifies which replica for the client to access during the read operation:

  • master (default) — Read the master replica.
  • any — Read an arbitrary replica (effectively a random replica).

NOTE: If you plan to use this solution for reads using lower data consistency guarantees, a later read may return an earlier value for the record. Please review the following document for a detailed explanation. See: Per-Transaction Consistency Guarantees – http://www.aerospike.com/docs/architecture/consistency.html

Use fire and forget (version 3.16 and above)

Finally, if the use case allows it, setting the write commit level to master only would act as a true fire and forget when executing the write on the replica side (only for version 3.16 and up) and prevent the backup of transactions on the rw-hash.

Keywords

HOTKEY FAIL_KEY_BUSY RW_HASH

Timestamp

05/18/2018


How to perform Hot Key Analysis?
Why do I see a warning - "write fail - queue too deep"?