Understanding aerospike server proxies


#1

Understanding Aerospike server proxies

Abstract

A read or write request from a client that end up on a cluster node that does not hold the partition the transaction is addressing will get proxied to the right cluster node. Proxies are expected when a migrations are occurring, specifically at the time when a partition ownership changes but the client has not yet received the updated partition map. The focus of this article is to help troubleshoot proxies on a stable cluster with no migrations.

What get proxied?

Depending on the namespace mode (strong consistency or not), the client’s policy setting and the transaction type, a transaction can be sent to a random node in a cluster to force it to proxy to the node containing the partition needed to process the transaction.

Statistics/Logs to monitor

client_proxy_complete - Number of successful proxy transactions the node has processed. This statistic shows the number of proxy transactions the node has initiated (because it didn’t have any copy of the partition required to process the transaction). client_proxy_error - Number of proxy requests initiated on the node that failed with an error. client_proxy_timeout - Number of proxy transactions initiated on the node that timed out.

From the server logs, you can monitor the ‘proxy’ values.

{ns_name} client: tsvc (0,0) proxy (0,0,0) read (126,0,1,3) write (2886,0,23) delete (197,0,1,19) udf (35,0,1) lang (26,7,0,3)

Refer to the log reference manual for further details.

What triggers proxies?

1) Connection issues on the service port (default 3000) between a client (XDR or standard clients) and server.

The client is unable to connect to a server node for a brief period (or longer) and might issue a partition sync thinking that some node has been out of the cluster/unreachable. In such case, the client may (depending on the namespace mode (strong consistency or not), the client’s policy and the transaction type) send the read/write request to a random node which may turn into a proxy to the correct node in order to fulfil the request. The usual culprits can be one of the following: AWS Security Group issues, ACL issues, server firewall, SELinux misconfigurations, IPTables configuration, network bandwidth limits. Issue often occurs when adding a new node to a cluster that has not been properly configured.

2) Connection issue on the fabric port (default 3001) between peer nodes.

Such connectivity issue would cause for example write transactions with the default policy to fail, causing the client to retry. This again, depending on the namespace mode (strong consistency or not) and the client’s retry policy could lead to the client retrying the transaction against a random node. Proxies would then mean that port 3001 is blocked for incoming fabric requests between nodes. Check the fabric port 3001. Same culprits as described above. The nc tool can be used to check both port 3000 (client→Server) and 3001 (Server→Server).

Steps to test port connectivity:

Install nc tool:

sudo yum install nc

Command:

nc -z -v -w5 <IP address> <port>

instead of nc, we can also use latest bash /dev/tcp

 timeout 1 bash -c 'cat < /dev/null > /dev/tcp/<IP address>/<port>';echo $?

Example test scripts :

for i in 192.168.1.{206,207,208,209,211,212,213,214,215,216,217,218,227,254,255};do nc -z -v -w5 $i 3000; done

nc: connect to 192.168.1.206 port 3000 (tcp) timed out: Operation now in progress
nc: connect to 192.168.1.207 port 3000 (tcp) succeeded!
nc: connect to 192.168.1.208 port 3000 (tcp) timed out: Operation now in progress
nc: connect to 192.168.1.209 port 3000 (tcp) succeeded!
nc: connect to 192.168.1.211 port 3000 (tcp) failed: No route to host
nc: connect to 192.168.1.212 port 3000 (tcp) succeeded!


[root@aero1 ~]# timeout 1 bash -c 'cat < /dev/null > /dev/tcp/192.168.120.171/3000';echo $?
0

[root@aero1 ~]# timeout 1 bash -c 'cat < /dev/null > /dev/tcp/192.168.120.171/3001';echo $?
0

[root@aero1 ~]# timeout 1 bash -c 'cat < /dev/null > /dev/tcp/192.168.120.171/3005';echo $?
bash: connect: Connection refused
bash: /dev/tcp/192.168.120.171/3005: Connection refused
1

3) High rate of clients restarts (older clients).

The following only applies for older client versions (released prior to early 2017).

Are any clients crashing and being restarted?

Each client restart will trigger a new partition map sync. This operation is performed at the client startup as well as at regular sync requests to check if the cluster state has changed. The sync is performed when the cluster state has changed. While the client library hasn’t yet synced fully with the cluster’s partition map, proxies could be seen (again, depending on the namespace configuration and the client’s policy).

This will also be the case when the client application is restarted frequently, or if the client application isn’t long lived. This means every time client application comes into life, it does a partition map sync and if the read/write requests are executed before the client is fully synced, it could send the request to the wrong node and the cluster has to do a proxy for handling such request.

Verify error logs of your clients for any restart and crashes.

4) Possibility of monitoring client or rogue dev client hitting servers in production.

An in depth analysis of IP addresses hitting the servers may reveal the offending IP address.

Command:

nestat -pant|egrep ‘3000|3001’

Conclusion

Proxies can occur for different connectivity issues between client and server or between master and replica nodes within the same cluster. A thorough analysis and test of network connectivity over the service port (3000) and fabric port (3001) may be required to track down the root cause. The rate of requests being proxied may also provide a clue as to the culprit.

Notes

This article assumes that cluster is stable and migrations are not occurring while seeing proxies. The impact of a proxy would be limited to the number of requests being proxied. Requests that are not proxied would still have a lower latency. The more requests being proxied, the more the overall requests latency on the cluster would get impacted.

Reference

Keywords

Proxies, Network, Latency, proxy

Timestamp

02/14/2018