Client connection count high on a few nodes in a cluster
The connection count (tracked under
client_connections) on a single node or few nodes are seen to be higher compared to other nodes in the cluster.
1. Slow node
If a node is not performing as expected, for example slow disk, slow network behavior or CPU, it can appear to have a higher number of client connections compared to other nodes. The slowness or increase in latency on that node may cause more number of requests to pile up on that node.
How does the client policy (timeout and retries) affect this situation?
If the client times out or encounters socket errors and retries are configured, the client will close the socket and potentially open a new connection (to the same node or another node, depending on the transaction type and policy details) shortly afterwards on the subsequent attempts for the transaction. On the server, the connections may remain open for
proto-fd-idle-ms (default 60 seconds). This will make the already potentially high number of client connections increase even further.
Depending on the transaction throughput and the client side connection pool size, a temporary slow down of a node could end up causing the increase in the active client connections to persist. If the connections are never idle for
proto-fd-idle-ms, the connections will simply be reused and having a higher count is not an issue on its own. One has to of course also monitor the
proto-fd-max threshold which would prevent new connections from being established.
A specific key being accessed very frequently can cause a higher client connection count on a node. Indeed, the Aerospike data distribution scheme always assigns a record to the same partition (primary key hash) which will end up being owned by one of the node in the cluster as a master copy and other nodes for its replica(s). To identify a hotkey effect, one can compare the throughput between the node with high connection count with other nodes in the cluster.
- For further details on debugging hotkeys, refer to the following links:
3. Info requests to a single node
An increased number of info requests hitting a particular node in the cluster can cause the connection count on that node to be higher compared to other nodes. One can check and compare the
info-q in aerospike logs) metric on the cluster nodes. Any application and monitoring tool or scripts which performs info calls more frequently to a particular node can cause such situations.
To debug this futher, one can use
netstat or similar command to identify or track the source IP and the monitoring tool.
4. Uneven data distribution
By default, Aerospike optimizes the partition distribution across nodes in a cluster to minimize migration traffic (moving of partitions between the nodes when a node is added or removed from the cluster). The
prefer-uniform-balance configuration forces a uniform distribution of partition at the expense of a bit more partition movement during migrations. A cluster with unbalanced partition distribution being accessed uniformly across its records in general will have more traffic against the nodes holding more partitions (records) causing a correlated imbalance in client connection count.
Other less common situations could lead into imbalanced data across nodes in a cluster. For example, a cluster that has received a restore job from a partial backup or that is currently getting data restored to it. Indeed, the backup process is a partition by partition scan, storing records on file in the order they are read. The restore process will therefore restore the records similarly, partition by partition, causing intermediate imbalance (or permanent if the backup was not full or if the restore was interrupted).
CLIENT CONNECTION HIGH HOTKEYS IMBALANCE UNBALANCED INFO
June 17th 2019