Partition Unavailable errors when a node joins back a cluster

Hello,

We are on Aerospike server version 6.3.0.24 (EE) and using Java client aerospike-client-jdk21 8.1.1 . Consistency mode is AP, replication-factor is 2 and prefer-uniform-balanceis true. My query policies are :

      QueryPolicy queryPolicy = new QueryPolicy();
      queryPolicy.maxConcurrentNodes = 1;
      queryPolicy.recordQueueSize = 10000;
      queryPolicy.socketTimeout = 370000; //370s

We have faced the following problem since we upgraded from version 4.5 (Community Edition) to 6.3 (EE) :

  • When a node goes down (due to power outage or intermittent network problems) migrations are triggered to honour RF=2.
  • When the node joins back, we see SI queries failing for the entire duration of migrations (i.e. till fill migrations complete and cluster is stable again), which can sometimes be hours! We see the following exceptions on the client side :

com.aerospike.client.AerospikeException: Error 11,6,0,370000,0,5,BB9B6F2E0EFEC3C 10.15.21.244 3000: Partition 3952 unavailable sub-exceptions: Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 628 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 775 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 776 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 779 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1033 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1394 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1412 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2285 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2798 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 3151 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 3571 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 4019 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 4048 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 926 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1813 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2118 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2255 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2569 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2678 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1015 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1220 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1663 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2002 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2223 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 3443 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 155 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 736 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1897 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1965 unavailable

From Aerospike support we have received the following 2 cases due to which these errors can happen :

  1. When cluster membership is in flux due to node dropping or joining : In these cases such errors are expected for a few seconds till the client obtains the correct partition maps. Increasing maxRetries / sleepBetweenRetries would be the right approach here if we want to avoid failures while the cluster is in a transient state.
  2. When queries are unable to reserve partitions : This is the case where a working master partition is marked as subset, and as such, queries will not be able to “reserve the partition” until they are “full”. Below is the detailed explanation we got from Aerospike Support.

here is the explanation for why queries would fail when a node is brought back with data in AP. This is apart from the short duration failures which can happen in between the client tends and partitions hand offs.

Adding a node with data (or when a node re-joins the cluster) will cause the working master to become subset and will not allow queries to reserve partitions in that state (which would be any partitions the rejoining node owns). This only happens in AP as we end up with different ‘families’ (the node that joins back could have been restarted and become a one node cluster before joining back the cluster), may not be the case always but still we do not allow queries to reserve partitions in that state.

For eg; Let’s say A, B, C & D are the nodes of the cluster. Let’s consider a partition P1.

P1 - A is master & B is replica, all good.

A goes down, no issue as B has full set of the partition.

A comes back into the cluster with data, the working master which is B in this case will be marked as subset and will not allow queries to reserve such a partition. In your case, since it’s a 12 node cluster, it will be 4096/12 = 342 partitions on that node and after that the partitions on other nodes which are waiting for incoming migrations since it is AP, there will be 2 way migrations.

The partition reservation will fail when there are pending migrations to the current master partition in this case B, regardless of its full/subset state. Technically, in AP, when the node rejoins we would still have the acting master full, but the fact that it is expecting migrations from the node that joined back in would cause the partition reservation to fail.

We also know that in AP mode when a node comes back (with or without data) there could be 2 rounds of migrations for a partition and causing the client to see an ‘old’ master for queries and exhaust all the retries on that node (it would be claiming master and replica until the partition is fully migrated). The sleepBetweenRetries could help to allow the client to wait before retrying and allow for potentially a new tend to complete, thus the next retry would likely be successful. This is for those short duration holes where the partition owner ship changes.

Increasing the sleepBetweenRetries / timeouts would only help for the ‘few errors’ when the node is quiesced but not the issue when the node joins back. This is by design for AP systems.

I hope this clears your doubts. The work around as I mentioned is to use relax-ap-queries.

In our case, case 2 above is the reason for queries failing for hours.

As suggested by Aerospike Support (and also mentioned in this article) the solution is to use relaxed queries (by choosing expectedDuration = SHORT / LONG_RELAX_AP on the client-side).

My questions are :

A comes back into the cluster with data, the working master which is B in this case will be marked as subset and will not allow queries to reserve such a partition.

The partition reservation will fail when there are pending migrations to the current master partition in this case B, regardless of its full/subset state. Technically, in AP, when the node rejoins we would still have the acting master full, but the fact that it is expecting migrations from the node that joined back in would cause the partition reservation to fail.

  1. I fail to understand why the working master B will be marked as “subset” in the above case when it is essentially full ? Does Aerospike have no way of knowing that the master partition is full? Is this a shortcoming in Aerospike or a fundamental problem with distributed databases ?
  2. Since we have replication factor=2, I think our expectation is reasonable that we should be able to tolerate 1 node going down and re-joining the cluster after some time. Why is Aerospike not able to tolerate 1 node failure ?
  3. Why didn’t we observe such issues in version 4.5 ? What changed in version 6.3 ?
  4. If relaxed queries avoid such failures during migrations, why are relaxed queries not the default (the default is expectedDuration = LONG which is not relaxed) ? Surely avoiding failures during node removal and re-joining would be desirable by most users of Aerospike.
  5. In order to use expectedDuration = LONG_RELAX_AP we need to upgrade to version 7.1 at least. Is there some way of avoid these query failures without having to upgrade the server ? We cannot use expectedDuration = SHORT because our queries return a large number of records per node (as opposed to the recommended ~ 100 per node).

Great question. Here is how I make sense of this (even if I may be simplifying a bit or may not have some of the exact nomenclature).

In AP mode, you will have each partition available in different sub-clusters at the same time in case of a split brain. So when the cluster re-forms, you will have 2-way migrations even though, each sub-cluster would have had each partition full as far as it was concerned. Basically each side may have stuff the other side does not. So, applying this to a node restarting, which can be viewed as ‘a node joining a cluster with data’, as far as the cluster is concerned, in AP mode, that node may have stuff the cluster does not (for example, as unlikely as it is, that node formed a single node cluster and took some writes before going down or after coming up and before joining). Therefore, you will have 2-way migrations when such node with data joins the cluster. Even though the existing working master would be full (as you pointed out), it would expect incoming migrations from the node that joined. This would prevent, by default, the partition reservation for queries while there are pending incoming migrations for it. For individual or batched reads/writes/UDF those would duplicate resolve against the node that has the pending migrations but queries do not support duplicate resolution at this point (nor do they support proxies).

The interesting thing is that this wouldn’t happen in strong consistency mode since a node joining back with data would not need to migrate anything into the cluster unless of course if we are in a situation where the node joining back is making a partition ‘available again’.

What changed between the versions you mentioned is the way the queries are implemented, specifically SI queries. SI queries, as of version 6.0 are on a per partition basis, to allow ‘chasing a specific partition’ during cluster changes.