We’re updating a large Aerospike cluster from 3.8.4 to 3.9.1 and migrations after each server restart take several days to finish. I notice that Replica Objects on the restarted server drop to 0 just after restart, while Master Objects remain what they were before. Is there a way to stop Replicas from being invalidated?
could you please provide logs:
do you find any warnings in the logs:
No special warnings. But are you saying that this is not standard behavior?
Good question. Please see this KB article:
Thanks Tony. However, here’s the point I’m trying to get at:
If Node A has 10^9 Master Objects, when you restart Node A it sees the Master Objects on disk and loads them.
However, if Node A has 20^9 Replica Objects, when you restart Node A those 20^9 Replica Objects are all deleted from node A and re-copied from the rest of the network in migrations.
This is suboptimal (and takes several days on our cluster). What it should do is to load the Replicas from disk like it loads Master Objects. Is there a way to force this, or will it require a code change?
What version of Aerospike are you using? Are you using Community or Enterprise version? Note Enterprise version will use persistent share memory and will re-balance faster than community version:
We’re using community 3.9.1.1. However it appears that this is the problem: It looks like Node restarting make all partitions migrate
Presently, any cluster change will cause migration of all the partitions. We do not trust the sanity of existing partitions and verifying the same will be similar overhead.
From the above discussion, the behavior appears not to be version specific and has existed for a long time. Our cluster has about 5G objects and 10G replica objects, so the above doesn’t scale very well to our use case.
Hey Joel,
As mentioned in the Rapid Rebalance engineering blog post that Tony shared, Community Edition has a brute force algorithm that streams entire partitions when a cluster changes its size, while Enterprise Edition will only migrate the delta for a partition that is changing its node. This enables rolling upgrades of a cluster to finish significantly faster, in the neighborhood of a 40x speed improvement.
Essentially this problem has been solved, and is an enterprise feature.
Would like to clear a few misconceptions present in this thread:
The replica objects are still there, you should notice that your used disk space didn’t drop by half. The partition’s objects just aren’t counted during certain state transitions to prevent over counts (by arbitrary decision it was decided to be better to under count records during migrations than to over count).
This KB article is a bit dated - the rec_refs stat no longer exists and was also described incorrectly in article.
Replicas are loaded from disk.
This is really old information, this behavior was change with 3.5.8 release (migrations were much worse before ;)).
I think that we might be seeing a regression in 3.9.1.1 then. Right now here is what is happening:
-
After upgrade to 3.9.1.1, we restart a node.
-
30 minutes later it rejoins the cluster, having read through all of its master objects.
-
Migrations go at full throttle for several days. (48 hours at 8 threads).
If replica objects really are being read from disk, shouldn’t this process take an hour or two at most? Something more congruent to the node restart time – the master objects only take 30 minutes to be read, after all.
- Is this longer than usual if so what are the prior versions you have upgraded from?
- Are there any warnings in the log?
- Could you provide the output of:
asadm -e "show stat like migrate"
-
Yes. We are upgrading from 3.8.4 to 3.9.1.1. At default migrate settings (1 thread, 1ms sleep), migrations went from 24 hours (on 3.8.4) to 1 week (on 3.9.1.1) after the upgrade of a single node.
-
No warnings in the logs.
- Has your transaction throughput increased since 3.8.4? Transactions are given priority over migrations.
- Has the amount of objects or size of objects increased significantly?
Actually it appears you are running 3 threads and 1 us sleep. Assuming the rate of migrations actually is down, you could try increasing the number of migrate-threads further.
asadm -e "asinfo -v 'set-config:context=service;migrate-threads=N'"
@kporter Sorry that I wasn’t clear. The 1 thread 1 ms sleep numbers was the timing for the first node of the cluster that we upgraded. Since then we have been using 8 threads for upgrades (see the first comment). But you wanted comparison numbers against 3.8.4, and we never used 8 threads for 3.8.4. So I gave you 1-thread:1ms-sleep timings for comparison.
When I made that log where you see 3 threads, I was in the process of bumping threads from 1 to 8 – we do it slowly to avoid latency spikes. If we upgrade up a node when the cluster is at 8 migration threads, latency explodes. (It appears that threads slow down over time?)
This is our network graph during migrations today. The purple line is the upgraded server. Notice how network used gradually climbs (I was raising threads all during this process) and eventually craters. We’re at 18% migration complete right about now. Maybe this is normal functioning, but I wonder if threads are getting blocked. And there does seem to be a lot of network activity if replicas are not being passed over the network.
Below is our migration completion rate (by percentage done) over the same 4 hour time period.
For partitions that diverged (node left and returned) replicas are passed over the network. Individually the nodes do not know which has the latest state so the non-primary partitions migrate all records to the primary, on completion the primary migrates all records back.
The Enterprise version uses a newer algorithm which is able to avoid these migrations by streaming record meta data from the immigrator node to the emigrator node allowing the emigrator to skip most of the disk reads and network sends in these situations (reducing typical migration time to completion by ~40x).
I am concerned about ~7x migration performance reduction you are observed between 3.8.4 are 3.9.1 as it wasn’t noticed by our internal tests and this being the first report. Any insights into if/how your data has grown would be very useful for reproduction attempts.
-
Transaction throughput has not changed appreciably
-
We had 4.387G objects, avg size 335 bytes, and now have 4.596G objects, avg size 386 bytes. The avg object size increase is due to some new fields (we have only two types of objects).
Attempted to create a small model of this scenario and report the total duration of migration.
In this test I used 350B x 200M objects in a 2 node cluster with data on SSDs only with only a single migration thread while also maintaining 250K tps 80/20 read/write ratio. In this test one of the 2 nodes would restart, with a 2 minute delay between stopping and starting, forcing all 4096 partitions to resync.
- Aerospike Community 3.8.4 49059s (13:37:39)
- Aerospike Community 3.9.1.1 49408s (13:43:28)
- Aerospike Community 3.10.0 51015s (14:10:15)
- Aerospike Enterprise 3.10.0 1383s (00:23:03)