I’m currently upgrading the cluster to the new version (also rebooting the node to upgrade the kernel etc) and adding a new ssd to the config. The cluster is made of 7 nodes, 400GB data on each node. Replication factor is 2.
It takes 1 hour to as to start, which is not an issue, but it takes roughly 16 hours for migrations to finish with migrations threads increased to 20 and more.
My question is do I need to wait that all migrations are done or is enough to wait that the as starts and then proceed with the next node without compromising data? The extra cpu load on server is not an issue.
Migrations are the way that aerospike is bringing your replication factor back up to where you configured it.
A replication factor of 2 means that you have 2 copies of every record.
If you have migrations in a cluster/namespace that has a replication factor of 2, this means that, until they are 100% done with migrations, you will only have 1 copy of some records.
So, if you take a server down - some of your data will be on that server, and some of it will only have 1 copy in the cluster. This means that once that server goes down… some of your data goes with it.
If you have persistent storage, Aerospike will be able to recover these records from the disk and bring them back into the cluster. There are some “gotchas” relying on this though, like zombie records (records that were deleted but come back to life upon a cold restart, or if a client sent a delete event while the node holding that data was offline). Then there’s the potential that the server might actually just not come back up after you’ve done something to it.
To be on the safe side, let migrations finish.
16 hours is a while though… maybe we could help identify the bottleneck?
It is as I suspected but I wanted to be sure. If I reboot the next node while the previous node hasn’t synced yet then some part of previous node data won’t be online.
It takes 12 to 16 hour, it depends how much I squeeze the machine. Specs of the nodes: bare metal, 2x 12 core cpu, 128gb mem, 5x 480GB ssd (3 drives are used for AS, the other two are used by the system). I’m using 24 migrations threads which loads the cpu to approx 85%, system load around 60 and produces approximately 700 Mbit/s of traffic. The cluster is made of 7 nodes and it has 3.2 TB of data in it.
How long should be taking the migrations to complete by your estimations?
You can also tune max-num-incoming. The default is 4, this limits the number of immigrations allowed to a node. Be carefull not to raise it too high, especially with 24 migrate threads, as it can overwhelm the system (to the point migrations are even slower). In the next release the max value for this config is 64.
Also Aerospike Enterprise has fast-restart and rapid-rebalace. This allows a node to restart and fully migrate data in minutes rather than hours.
Correct, though fast restart doesn’t bring back deleted records (only coldstart) and rapid-rebalance will be much faster, even with secondary indices. Also if deleted records reappearing is an issue, enterprise also has durable deletes.