Client connection high on few cluster nodes

Aerospike_Knowledge · June 24, 2019, 6:58am

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Client connection count high on a few nodes in a cluster

Description

The connection count (tracked under client_connections) on a single node or few nodes are seen to be higher compared to other nodes in the cluster.

Common reasons:

1. Slow node

If a node is not performing as expected, for example slow disk, slow network behavior or CPU, it can appear to have a higher number of client connections compared to other nodes. The slowness or increase in latency on that node may cause more number of requests to pile up on that node.

How does the client policy (timeout and retries) affect this situation?

If the client times out or encounters socket errors and retries are configured, the client will close the socket and potentially open a new connection (to the same node or another node, depending on the transaction type and policy details) shortly afterwards on the subsequent attempts for the transaction. On the server, the connections may remain open for proto-fd-idle-ms (default 60 seconds). This will make the already potentially high number of client connections increase even further.

Recovery

Depending on the transaction throughput and the client side connection pool size, a temporary slow down of a node could end up causing the increase in the active client connections to persist. If the connections are never idle for proto-fd-idle-ms, the connections will simply be reused and having a higher count is not an issue on its own. One has to of course also monitor the proto-fd-max threshold which would prevent new connections from being established.

2. Hotkeys

A specific key being accessed very frequently can cause a higher client connection count on a node. Indeed, the Aerospike data distribution scheme always assigns a record to the same partition (primary key hash) which will end up being owned by one of the node in the cluster as a master copy and other nodes for its replica(s). To identify a hotkey effect, one can compare the throughput between the node with high connection count with other nodes in the cluster.

For further details on debugging hotkeys, refer to the following links:
- Hot Key error code 14
- How To identify read hotkeys

3. Info requests to a single node

An increased number of info requests hitting a particular node in the cluster can cause the connection count on that node to be higher compared to other nodes. One can check and compare the info_queue (or info-q in aerospike logs) metric on the cluster nodes. Any application and monitoring tool or scripts which performs info calls more frequently to a particular node can cause such situations. To debug this futher, one can use netstat or similar command to identify or track the source IP and the monitoring tool.

4. Uneven data distribution

By default, Aerospike optimizes the partition distribution across nodes in a cluster to minimize migration traffic (moving of partitions between the nodes when a node is added or removed from the cluster). The prefer-uniform-balance configuration forces a uniform distribution of partition at the expense of a bit more partition movement during migrations. A cluster with unbalanced partition distribution being accessed uniformly across its records in general will have more traffic against the nodes holding more partitions (records) causing a correlated imbalance in client connection count.

Other less common situations could lead into imbalanced data across nodes in a cluster. For example, a cluster that has received a restore job from a partial backup or that is currently getting data restored to it. Indeed, the backup process is a partition by partition scan, storing records on file in the order they are read. The restore process will therefore restore the records similarly, partition by partition, causing intermediate imbalance (or permanent if the backup was not full or if the restore was interrupted).

Keywords

CLIENT CONNECTION HIGH HOTKEYS IMBALANCE UNBALANCED INFO

Timestamp

June 17th 2019

Gunjan_Sharma · July 1, 2019, 9:03am

Also check client version if using JAVA client.

Java client < 4.3.1 uses FIFO queue for client pool causing the connection churn to be almost 0.

With client version >= 4.3.1 they have changed it to LIFO Stack and churn of connectins from the back of the Stack. Making it more connection reuse friendly.

Topic		Replies	Views
Maximum number of connection to an Aerospike node query	1	3198	September 15, 2017
Client Connection taking time in cluster mode Configuration	7	2329	December 29, 2016
Multiple aerospike_connect() from 1 host leads to uncontrolled too many connections C Client Library client	11	1016	March 25, 2020
Proto-fd-max Configuration	1	1619	June 30, 2022
Client returns Max Retries reached when node is re-joined to the cluster	8	718	April 2, 2024