Client returns Max Retries reached when node is re-joined to the cluster

Hi all,

We are using Aerospike(6.0) to store data for one of our sub-domains in the company and we are recently seeing abnormal behaviour when we update nodes of the cluster one by one.

The case is the following: We want to increase write-block-size from 128 to 256 for one of our clusters, so we start changing the configuration in a rolling-up manner one node at a time always waiting for migrations to be done before re-joining the updated node.

We are following the procedure described here:

Only before continuing to another node we wait for all migrations to pass.

However, what we noticed is that everytime the node re-joins the cluster it has 0 master objects and the cluster starts “rebalancing” the data, during this time our applications that work with this cluster intermittently receive errors for MaxRetries reached for read operations.

In order to speed up the discussion I will pre-answer the questions asked here (Java client return timeout once one of the nodes is down) because I do see some similarities between the topics.

  1. All our policies are default.
  2. Cluster size is updated.

I find it hard to categorize the topic as I am unaware of where the problem exactly might be (in the procedure, server, client or etc.) so I will leave it as General Discussion but feel free to move it around if you find a better place.

Hard to guess without access to the logs (client and server) but some quick thoughts:

  • Proxies… throughout migrations, when a partition ownership changes, reads could have to proxy from one node to the other, adding to the latency of those specific transactions which could result in timeout on the client side (depending on the policies – default depend on the client version and those can be overwritten on a per transaction basis too).

  • If running in strong consistency mode, duplicate resolution would also add to the latencies.

The client side error message should give some hints (policies that were in effect and the server node that was last involved). The client type and version may also be important. Those are usually fairly straight forward to get to the bottom of and tune accordingly by looking at client/server logs. If using the Enterprise Edition, open a Support case.

Thanks for reaching out, I have some answers to your questions.

  1. Unfortunately we are not using Enterprise Edition just yet.
  2. We are using .NET client version (5.3.0) with default policies (I know it can be overridden on a transaction basis we are not doing this.)
  3. We are not using strong consistency mode.
  4. Client and server logs were really nothing unusual server was having only INFO level logs for instance basically showing the progress of migrations nothing less.

Another clue: Today, we continued with upgrade but this time we exclude the node, wait for migrations to pass, zero the disks, do the config change and re-join the node and this way we didn’t experience any disturbances during reads while node was joining, I don’t know if this will help but its something we noticed.

The logs would have statistics update showing the proxies, latencies, duplicate resolution and more. It is not about WARNING or ERROR log messages. Are your timeouts on the client pretty aggressive? Adding the node empty would change a bit the profile of the migrations but shouldn’t change that much overall… the logs would help check further which node(s) are having more latencies, at what time, whether the timeout are from client or server, etc… (client logs also help).

I managed to get server logs from the problematic period, they can be found here if they will help:

We have some record-too-big errors but they are known errors.

So, doesn’t seem like any read timeouts on this node (the third number under read, as per the log reference manual). No proxy either during that time:

Mar 27 2023 11:56:52 GMT: INFO (info): (ticker.c:616) {Settlement} client: tsvc (0,0) proxy (3,0,0) read (237094107,0,0,13454323,0) write (177505333,1334,1,0) delete (0,0,0,0,0) udf (0,0,0,0) lang (0,0,0,0)

So there must be client timeouts or server timeouts from other nodes in the cluster. The client log message that comes with the timeout / max retries exceeded may confirm that. Do you have that? The network is, as expected, more used during migrations and I don’t know whether that could be impacting the latencies on the instance types being used and based on the client policies (default as you mentioned). Here is a graph from the log you shared showing the network usage for the different fabric channels:

This is the relevant migrations status during that time:

This is something Aerospike Support is well equipped to dig further into (providing client logs and full cluster logs) so we may not be able to fully get to more details and keep it at guesses. You can always tune migrations and see whether that helps (Migrations | Aerospike Documentation). You did indicate that bringing a node empty did help… that does change the nature of the migrations to some extend… Looking at the latency histograms on the node you shared logs for, there is no indication of anything slowing down much… but again, this is only one node out of five, so not a full picture:

Hi there,

Sorry for the late reply, for now at best I can do is to provide logs for the aforementioned interval for all nodes in the cluster if we still don’t see anything I will look for further details

Unfortunately, I am afraid client logs will be unavailable, but let me know if I can help with anything else

Client logs would be the best to check, though…