Client returns Max Retries reached when node is re-joined to the cluster

kuskmen · March 27, 2023, 3:47pm

Hi all,

We are using Aerospike(6.0) to store data for one of our sub-domains in the company and we are recently seeing abnormal behaviour when we update nodes of the cluster one by one.

The case is the following: We want to increase write-block-size from 128 to 256 for one of our clusters, so we start changing the configuration in a rolling-up manner one node at a time always waiting for migrations to be done before re-joining the updated node.

We are following the procedure described here: https://support.aerospike.com/s/article/How-do-I-change-the-write-block-size-configuration

Only before continuing to another node we wait for all migrations to pass.

However, what we noticed is that everytime the node re-joins the cluster it has 0 master objects and the cluster starts “rebalancing” the data, during this time our applications that work with this cluster intermittently receive errors for MaxRetries reached for read operations.

In order to speed up the discussion I will pre-answer the questions asked here (Java client return timeout once one of the nodes is down) because I do see some similarities between the topics.

All our policies are default.
Cluster size is updated.

I find it hard to categorize the topic as I am unaware of where the problem exactly might be (in the procedure, server, client or etc.) so I will leave it as General Discussion but feel free to move it around if you find a better place.

meher · March 28, 2023, 8:28pm

Hard to guess without access to the logs (client and server) but some quick thoughts:

Proxies… throughout migrations, when a partition ownership changes, reads could have to proxy from one node to the other, adding to the latency of those specific transactions which could result in timeout on the client side (depending on the policies – default depend on the client version and those can be overwritten on a per transaction basis too).
If running in strong consistency mode, duplicate resolution would also add to the latencies.

The client side error message should give some hints (policies that were in effect and the server node that was last involved). The client type and version may also be important. Those are usually fairly straight forward to get to the bottom of and tune accordingly by looking at client/server logs. If using the Enterprise Edition, open a Support case.

kuskmen · March 28, 2023, 10:29pm

Thanks for reaching out, I have some answers to your questions.

Unfortunately we are not using Enterprise Edition just yet.
We are using .NET client version (5.3.0) with default policies (I know it can be overridden on a transaction basis we are not doing this.)
We are not using strong consistency mode.
Client and server logs were really nothing unusual server was having only INFO level logs for instance basically showing the progress of migrations nothing less.

Another clue: Today, we continued with upgrade but this time we exclude the node, wait for migrations to pass, zero the disks, do the config change and re-join the node and this way we didn’t experience any disturbances during reads while node was joining, I don’t know if this will help but its something we noticed.

meher · March 29, 2023, 2:55am

The logs would have statistics update showing the proxies, latencies, duplicate resolution and more. It is not about WARNING or ERROR log messages. Are your timeouts on the client pretty aggressive? Adding the node empty would change a bit the profile of the migrations but shouldn’t change that much overall… the logs would help check further which node(s) are having more latencies, at what time, whether the timeout are from client or server, etc… (client logs also help).

kuskmen · March 29, 2023, 4:42pm

I managed to get server logs from the problematic period, they can be found here if they will help:

gist.github.com

https://gist.github.com/kuskmen/3793841d6f4330be682d8256443dd3d5

log

Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:162) NODE-ID bb9c865eb0a0142 CLUSTER-SIZE 5
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:243)    cluster-clock: skew-ms 0
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:264)    system: total-cpu-pct 241 user-cpu-pct 206 kernel-cpu-pct 35 free-mem-kbytes 8263276 free-mem-pct 33 thp-mem-kbytes 0
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:286)    process: cpu-pct 226 threads (13,76,49,30) heap-kbytes (12438053,15385584,16505344) heap-efficiency-pct 80.8
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:296)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:319)    fds: proto (72,10035522,10035450) heartbeat (4,112,108) fabric (96,216,120)
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:328)    heartbeat-received: self 2 foreign 612966588
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (13792131,9436036) ctrl (44,37) meta (0,0) rw (198971,244479)
Mar 27 2023 11:20:02 GMT: INFO (info): (ticker.c:404)    batch-index: batches (40524,0,0) delays 0
Mar 27 2023 11:20:02 GMT: INFO (info): (hist.c:320) histogram dump: batch-index (40524 total) msec

This file has been truncated. show original

We have some record-too-big errors but they are known errors.

meher · March 29, 2023, 6:15pm

So, doesn’t seem like any read timeouts on this node (the third number under read, as per the log reference manual). No proxy either during that time:

Mar 27 2023 11:56:52 GMT: INFO (info): (ticker.c:616) {Settlement} client: tsvc (0,0) proxy (3,0,0) read (237094107,0,0,13454323,0) write (177505333,1334,1,0) delete (0,0,0,0,0) udf (0,0,0,0) lang (0,0,0,0)

So there must be client timeouts or server timeouts from other nodes in the cluster. The client log message that comes with the timeout / max retries exceeded may confirm that. Do you have that? The network is, as expected, more used during migrations and I don’t know whether that could be impacting the latencies on the instance types being used and based on the client policies (default as you mentioned). Here is a graph from the log you shared showing the network usage for the different fabric channels:

This is the relevant migrations status during that time:

This is something Aerospike Support is well equipped to dig further into (providing client logs and full cluster logs) so we may not be able to fully get to more details and keep it at guesses. You can always tune migrations and see whether that helps (Migrations | Aerospike Documentation). You did indicate that bringing a node empty did help… that does change the nature of the migrations to some extend… Looking at the latency histograms on the node you shared logs for, there is no indication of anything slowing down much… but again, this is only one node out of five, so not a full picture:

kuskmen · April 3, 2023, 8:33am

Hi there,

Sorry for the late reply, for now at best I can do is to provide logs for the aforementioned interval for all nodes in the cluster if we still don’t see anything I will look for further details

gist.github.com

https://gist.github.com/kuskmen/adb870ed8341bfe66f86dac1ae022e59

aerospike-node-1.log

Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:162) NODE-ID bb9c365eb0a0142 CLUSTER-SIZE 5
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:243)    cluster-clock: skew-ms 0
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:264)    system: total-cpu-pct 171 user-cpu-pct 130 kernel-cpu-pct 41 free-mem-kbytes 10650548 free-mem-pct 43 thp-mem-kbytes 2048
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:286)    process: cpu-pct 177 threads (11,64,39,30) heap-kbytes (13126452,13202936,14510592) heap-efficiency-pct 99.4
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:296)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:319)    fds: proto (118,126710,126592) heartbeat (4,11,7) fabric (96,120,24)
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:328)    heartbeat-received: self 2 foreign 6885044
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:354)    fabric-bytes-per-second: bulk (14745726,21686587) ctrl (49,46) meta (0,0) rw (278240,246097)
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:413) {Settlement} objects: all 53209738 master 15771008 prole 34411759 non-replica 3026968
Mar 27 2023 11:20:03 GMT: INFO (info): (ticker.c:470) {Settlement} migrations: remaining (144,581,216) active (2,4,0) complete-pct 66.07

This file has been truncated. show original

aerospike-node-2.log

aerospike-node-4.log

There are more than three files. show original

Unfortunately, I am afraid client logs will be unavailable, but let me know if I can help with anything else

meher · April 3, 2023, 10:43pm

Client logs would be the best to check, though…

system · April 2, 2024, 10:44pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query fails during migration Java Client query	5	476	December 18, 2023
Urgent: Migration stuck v3.8.1, missing acks from node migration	20	4121	June 28, 2017
Cluster synchronization: re-write keys Tuning	7	4685	August 18, 2014
Aerospike migrations issue/ data loss Migration query	12	1799	July 15, 2019
Aerospike Node Entering and Exiting the Cluster Frequently Configuration	9	1950	July 1, 2017

Client returns Max Retries reached when node is re-joined to the cluster

Related topics