Hello,
We are on Aerospike server version 6.3.0.24 (EE) and using Java client aerospike-client-jdk21 8.1.1
. Consistency mode is AP, replication-factor is 2 and prefer-uniform-balance
is true. My query policies are :
QueryPolicy queryPolicy = new QueryPolicy();
queryPolicy.maxConcurrentNodes = 1;
queryPolicy.recordQueueSize = 10000;
queryPolicy.socketTimeout = 370000; //370s
We have faced the following problem since we upgraded from version 4.5
(Community Edition) to 6.3
(EE) :
- When a node goes down (due to power outage or intermittent network problems) migrations are triggered to honour RF=2.
- When the node joins back, we see SI queries failing for the entire duration of migrations (i.e. till fill migrations complete and cluster is stable again), which can sometimes be hours! We see the following exceptions on the client side :
com.aerospike.client.AerospikeException: Error 11,6,0,370000,0,5,BB9B6F2E0EFEC3C 10.15.21.244 3000: Partition 3952 unavailable sub-exceptions: Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 628 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 775 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 776 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 779 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1033 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1394 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1412 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2285 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2798 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 3151 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 3571 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 4019 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 4048 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 926 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1813 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2118 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2255 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2569 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2678 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1015 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1220 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1663 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2002 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 2223 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 3443 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 155 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 736 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1897 unavailable Error 11,1,BB976F2E0EFEC3C 10.15.20.50 3000: Partition 1965 unavailable
From Aerospike support we have received the following 2 cases due to which these errors can happen :
- When cluster membership is in flux due to node dropping or joining : In these cases such errors are expected for a few seconds till the client obtains the correct partition maps. Increasing
maxRetries
/sleepBetweenRetries
would be the right approach here if we want to avoid failures while the cluster is in a transient state. - When queries are unable to reserve partitions : This is the case where a working master partition is marked as subset, and as such, queries will not be able to “reserve the partition” until they are “full”. Below is the detailed explanation we got from Aerospike Support.
here is the explanation for why queries would fail when a node is brought back with data in AP. This is apart from the short duration failures which can happen in between the client tends and partitions hand offs.
Adding a node with data (or when a node re-joins the cluster) will cause the working master to become subset and will not allow queries to reserve partitions in that state (which would be any partitions the rejoining node owns). This only happens in AP as we end up with different ‘families’ (the node that joins back could have been restarted and become a one node cluster before joining back the cluster), may not be the case always but still we do not allow queries to reserve partitions in that state.
For eg; Let’s say A, B, C & D are the nodes of the cluster. Let’s consider a partition P1.
P1 - A is master & B is replica, all good.
A goes down, no issue as B has full set of the partition.
A comes back into the cluster with data, the working master which is B in this case will be marked as subset and will not allow queries to reserve such a partition. In your case, since it’s a 12 node cluster, it will be 4096/12 = 342 partitions on that node and after that the partitions on other nodes which are waiting for incoming migrations since it is AP, there will be 2 way migrations.
The partition reservation will fail when there are pending migrations to the current master partition in this case B, regardless of its full/subset state. Technically, in AP, when the node rejoins we would still have the acting master full, but the fact that it is expecting migrations from the node that joined back in would cause the partition reservation to fail.
We also know that in AP mode when a node comes back (with or without data) there could be 2 rounds of migrations for a partition and causing the client to see an ‘old’ master for queries and exhaust all the retries on that node (it would be claiming master and replica until the partition is fully migrated). The
sleepBetweenRetries
could help to allow the client to wait before retrying and allow for potentially a new tend to complete, thus the next retry would likely be successful. This is for those short duration holes where the partition owner ship changes.Increasing the sleepBetweenRetries / timeouts would only help for the ‘few errors’ when the node is quiesced but not the issue when the node joins back. This is by design for AP systems.
I hope this clears your doubts. The work around as I mentioned is to use relax-ap-queries.
In our case, case 2 above is the reason for queries failing for hours.
As suggested by Aerospike Support (and also mentioned in this article) the solution is to use relaxed queries (by choosing expectedDuration = SHORT / LONG_RELAX_AP
on the client-side).
My questions are :
A comes back into the cluster with data, the working master which is B in this case will be marked as subset and will not allow queries to reserve such a partition.
The partition reservation will fail when there are pending migrations to the current master partition in this case B, regardless of its full/subset state. Technically, in AP, when the node rejoins we would still have the acting master full, but the fact that it is expecting migrations from the node that joined back in would cause the partition reservation to fail.
- I fail to understand why the working master B will be marked as “subset” in the above case when it is essentially full ? Does Aerospike have no way of knowing that the master partition is full? Is this a shortcoming in Aerospike or a fundamental problem with distributed databases ?
- Since we have replication factor=2, I think our expectation is reasonable that we should be able to tolerate 1 node going down and re-joining the cluster after some time. Why is Aerospike not able to tolerate 1 node failure ?
- Why didn’t we observe such issues in version 4.5 ? What changed in version 6.3 ?
- If relaxed queries avoid such failures during migrations, why are relaxed queries not the default (the default is expectedDuration = LONG which is not relaxed) ? Surely avoiding failures during node removal and re-joining would be desirable by most users of Aerospike.
- In order to use
expectedDuration = LONG_RELAX_AP
we need to upgrade to version 7.1 at least. Is there some way of avoid these query failures without having to upgrade the server ? We cannot useexpectedDuration = SHORT
because our queries return a large number of records per node (as opposed to the recommended ~ 100 per node).