Client connection high on few cluster nodes

Client connection count high on a few nodes in a cluster

Description

The connection count (tracked under client_connections) on a single node or few nodes are seen to be higher compared to other nodes in the cluster.

Common reasons:

1. Slow node

If a node is not performing as expected, for example slow disk, slow network behavior or CPU, it can appear to have a higher number of client connections compared to other nodes. The slowness or increase in latency on that node may cause more number of requests to pile up on that node.

How does the client policy (timeout and retries) affect this situation?

If the client times out or encounters socket errors and retries are configured, the client will close the socket and potentially open a new connection (to the same node or another node, depending on the transaction type and policy details) shortly afterwards on the subsequent attempts for the transaction. On the server, the connections may remain open for proto-fd-idle-ms (default 60 seconds). This will make the already potentially high number of client connections increase even further.

Recovery

Depending on the transaction throughput and the client side connection pool size, a temporary slow down of a node could end up causing the increase in the active client connections to persist. If the connections are never idle for proto-fd-idle-ms, the connections will simply be reused and having a higher count is not an issue on its own. One has to of course also monitor the proto-fd-max threshold which would prevent new connections from being established.

2. Hotkeys

A specific key being accessed very frequently can cause a higher client connection count on a node. Indeed, the Aerospike data distribution scheme always assigns a record to the same partition (primary key hash) which will end up being owned by one of the node in the cluster as a master copy and other nodes for its replica(s). To identify a hotkey effect, one can compare the throughput between the node with high connection count with other nodes in the cluster.

3. Info requests to a single node

An increased number of info requests hitting a particular node in the cluster can cause the connection count on that node to be higher compared to other nodes. One can check and compare the info_queue (or info-q in aerospike logs) metric on the cluster nodes. Any application and monitoring tool or scripts which performs info calls more frequently to a particular node can cause such situations. To debug this futher, one can use netstat or similar command to identify or track the source IP and the monitoring tool.

4. Uneven data distribution

By default, Aerospike optimizes the partition distribution across nodes in a cluster to minimize migration traffic (moving of partitions between the nodes when a node is added or removed from the cluster). The prefer-uniform-balance configuration forces a uniform distribution of partition at the expense of a bit more partition movement during migrations. A cluster with unbalanced partition distribution being accessed uniformly across its records in general will have more traffic against the nodes holding more partitions (records) causing a correlated imbalance in client connection count.

Other less common situations could lead into imbalanced data across nodes in a cluster. For example, a cluster that has received a restore job from a partial backup or that is currently getting data restored to it. Indeed, the backup process is a partition by partition scan, storing records on file in the order they are read. The restore process will therefore restore the records similarly, partition by partition, causing intermediate imbalance (or permanent if the backup was not full or if the restore was interrupted).

Keywords

CLIENT CONNECTION HIGH HOTKEYS IMBALANCE UNBALANCED INFO

Timestamp

June 17th 2019

1 Like

Also check client version if using JAVA client.

Java client < 4.3.1 uses FIFO queue for client pool causing the connection churn to be almost 0.

With client version >= 4.3.1 they have changed it to LIFO Stack and churn of connectins from the back of the Stack. Making it more connection reuse friendly.