Understanding aerospike server proxies

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Understanding Aerospike server proxies

Abstract

A read or write request from a client that end up on a cluster node that does not hold the partition the transaction is addressing will get proxied to the right cluster node. Proxies are expected when a migrations are occurring, specifically at the time when a partition ownership changes but the client has not yet received the updated partition map. The focus of this article is to help troubleshoot proxies on a stable cluster with no migrations.

What get proxied?

Depending on the namespace mode (strong consistency or not), the client’s policy setting and the transaction type, a transaction can be sent to a random node in a cluster to force it to proxy to the node containing the partition needed to process the transaction.

Statistics/Logs to monitor

client_proxy_complete - Number of successful proxy transactions the node has processed. This statistic shows the number of proxy transactions the node has initiated (because it didn’t have any copy of the partition required to process the transaction). client_proxy_error - Number of proxy requests initiated on the node that failed with an error. client_proxy_timeout - Number of proxy transactions initiated on the node that timed out.

From the server logs, you can monitor the ‘proxy’ values.

{ns_name} client: tsvc (0,0) proxy (0,0,0) read (126,0,1,3) write (2886,0,23) delete (197,0,1,19) udf (35,0,1) lang (26,7,0,3)

Refer to the log reference manual for further details.

What triggers proxies?

1) Connection issues on the service port (default 3000) between a client (XDR or standard clients) and server.

The client is unable to connect to a server node for a brief period (or longer) and might issue a partition sync thinking that some node has been out of the cluster/unreachable. In such case, the client may (depending on the namespace mode (strong consistency or not), the client’s policy and the transaction type) send the read/write request to a random node which may turn into a proxy to the correct node in order to fulfil the request. The usual culprits can be one of the following: AWS Security Group issues, ACL issues, server firewall, SELinux misconfigurations, IPTables configuration, network bandwidth limits. Issue often occurs when adding a new node to a cluster that has not been properly configured.

2) Connection issue on the fabric port (default 3001) between peer nodes.

Such connectivity issue would cause for example write transactions with the default policy to fail, causing the client to retry. This again, depending on the namespace mode (strong consistency or not) and the client’s retry policy could lead to the client retrying the transaction against a node holding a replica copy. Proxies would then mean that port 3001 is blocked for incoming fabric requests between nodes. Check the fabric port 3001. Same culprits as described above. The nc tool can be used to check both port 3000 (client→Server) and 3001 (Server→Server).

Steps to test port connectivity:

Install nc tool:

sudo yum install nc

Command:

nc -z -v -w5 <IP address> <port>

instead of nc, we can also use latest bash /dev/tcp

 timeout 1 bash -c 'cat < /dev/null > /dev/tcp/<IP address>/<port>';echo $?

Example test scripts :

for i in 192.168.1.{206,207,208,209,211,212,213,214,215,216,217,218,227,254,255};do nc -z -v -w5 $i 3000; done

nc: connect to 192.168.1.206 port 3000 (tcp) timed out: Operation now in progress
nc: connect to 192.168.1.207 port 3000 (tcp) succeeded!
nc: connect to 192.168.1.208 port 3000 (tcp) timed out: Operation now in progress
nc: connect to 192.168.1.209 port 3000 (tcp) succeeded!
nc: connect to 192.168.1.211 port 3000 (tcp) failed: No route to host
nc: connect to 192.168.1.212 port 3000 (tcp) succeeded!


[root@aero1 ~]# timeout 1 bash -c 'cat < /dev/null > /dev/tcp/192.168.120.171/3000';echo $?
0

[root@aero1 ~]# timeout 1 bash -c 'cat < /dev/null > /dev/tcp/192.168.120.171/3001';echo $?
0

[root@aero1 ~]# timeout 1 bash -c 'cat < /dev/null > /dev/tcp/192.168.120.171/3005';echo $?
bash: connect: Connection refused
bash: /dev/tcp/192.168.120.171/3005: Connection refused
1

3) High rate of clients restarts (older clients).

The following only applies for older client versions (released prior to early 2017).

Are any clients crashing and being restarted?

Each client restart will trigger a new partition map sync. This operation is performed at the client startup as well as at regular sync requests to check if the cluster state has changed. The sync is performed when the cluster state has changed. While the client library hasn’t yet synced fully with the cluster’s partition map, proxies could be seen (again, depending on the namespace configuration and the client’s policy).

This will also be the case when the client application is restarted frequently, or if the client application isn’t long lived. This means every time client application comes into life, it does a partition map sync and if the read/write requests are executed before the client is fully synced, it could send the request to the wrong node and the cluster has to do a proxy for handling such request.

Verify error logs of your clients for any restart and crashes.

4) Possibility of monitoring client or rogue dev client hitting servers in production.

An in depth analysis of IP addresses hitting the servers may reveal the offending IP address.

Command:

nestat -pant|egrep ‘3000|3001’

How to identify proxy clients?

In version 3.16.0.1, a new log context, proxy-divert, was introduced to capture the IP address of clients generating proxies on a server.

In order to identify clients that are triggering proxies, enable proxy-divert log context with a detail log level.

asinfo -v "set-log:id=0;proxy-divert=detail"

Example:

$ asinfo -v "set-log:id=0;proxy-divert=detail"
ok
$ sudo journalctl -o cat -u aerospike -a -f
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0x4151de9f50abbeef960ed5313444d1d398433793
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0xf38d7b259abd4566c8f1e61630c13411a4acfdd3
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0xd6ea47795b0f8f4fffa76ae4a7fb10d82dbedbe7
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0xc4508f3952a89b8212f99c3620b9c7172440ca0a
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0x974bd82e8c446b1bccb1c6f1220973c6fd940871
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0x66468ffec04b006c9386b5024afb3a6b37c6ad36
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0x42bbcf5e1ce7ceb8319195fdd2a030483d67680a
Aug 06 2019 00:30:53 GMT: DETAIL (proxy-divert): (proxy.c:229) {test} diverting from client 1.1.1.101:38422 to node bb902110a000064 <Digest>:0xd10ce4dc16e4d69bc8dabbeabe44b8f3d4aac43b

Conclusion

Proxies can occur for different connectivity issues between client and server or between master and replica nodes within the same cluster. A thorough analysis and test of network connectivity over the service port (3000) and fabric port (3001) may be required to track down the root cause. The rate of requests being proxied may also provide a clue as to the culprit.

Notes

This article assumes that cluster is stable and migrations are not occurring while seeing proxies. The impact of a proxy would be limited to the number of requests being proxied. Requests that are not proxied would still have a lower latency. The more requests being proxied, the more the overall requests latency on the cluster would get impacted.

Reference

Keywords

Proxies, Network, Latency, proxy

Timestamp

August 2019