Cluster upgrade

matijav · May 28, 2017, 10:55am

Hi,

I’m currently upgrading the cluster to the new version (also rebooting the node to upgrade the kernel etc) and adding a new ssd to the config. The cluster is made of 7 nodes, 400GB data on each node. Replication factor is 2.

It takes 1 hour to as to start, which is not an issue, but it takes roughly 16 hours for migrations to finish with migrations threads increased to 20 and more.

My question is do I need to wait that all migrations are done or is enough to wait that the as starts and then proceed with the next node without compromising data? The extra cpu load on server is not an issue.

Thanks, Matija

Albot · May 29, 2017, 2:38am

Migrations are the way that aerospike is bringing your replication factor back up to where you configured it. A replication factor of 2 means that you have 2 copies of every record.

If you have migrations in a cluster/namespace that has a replication factor of 2, this means that, until they are 100% done with migrations, you will only have 1 copy of some records.

So, if you take a server down - some of your data will be on that server, and some of it will only have 1 copy in the cluster. This means that once that server goes down… some of your data goes with it.

If you have persistent storage, Aerospike will be able to recover these records from the disk and bring them back into the cluster. There are some “gotchas” relying on this though, like zombie records (records that were deleted but come back to life upon a cold restart, or if a client sent a delete event while the node holding that data was offline). Then there’s the potential that the server might actually just not come back up after you’ve done something to it.

To be on the safe side, let migrations finish. 16 hours is a while though… maybe we could help identify the bottleneck?

matijav · May 29, 2017, 8:27am

Hi Albert,

Thanks for the response.

It is as I suspected but I wanted to be sure. If I reboot the next node while the previous node hasn’t synced yet then some part of previous node data won’t be online.

It takes 12 to 16 hour, it depends how much I squeeze the machine. Specs of the nodes: bare metal, 2x 12 core cpu, 128gb mem, 5x 480GB ssd (3 drives are used for AS, the other two are used by the system). I’m using 24 migrations threads which loads the cpu to approx 85%, system load around 60 and produces approximately 700 Mbit/s of traffic. The cluster is made of 7 nodes and it has 3.2 TB of data in it.

How long should be taking the migrations to complete by your estimations?

kporter · May 29, 2017, 5:13pm

You can also tune max-num-incoming. The default is 4, this limits the number of immigrations allowed to a node. Be carefull not to raise it too high, especially with 24 migrate threads, as it can overwhelm the system (to the point migrations are even slower). In the next release the max value for this config is 64.

Also Aerospike Enterprise has fast-restart and rapid-rebalace. This allows a node to restart and fully migrate data in minutes rather than hours.

Albot · May 29, 2017, 6:06pm

Assuming no secondary indexes, and you are tombstoning your data or not sensitive to zombie records*

kporter · May 29, 2017, 6:47pm

Correct, though fast restart doesn’t bring back deleted records (only coldstart) and rapid-rebalance will be much faster, even with secondary indices. Also if deleted records reappearing is an issue, enterprise also has durable deletes.

matijav · May 29, 2017, 6:49pm

Thanks for the answers, I’ll try that. We don’t have secondary indexes and we don’t do deletes.

What kind of money are we talking here for the enterprise edition?

Albot · May 29, 2017, 8:37pm

AFIK its based on the amount of unique data you’re storing. You need to reach out to sales for a quote here Contact Us | Aerospike

Topic		Replies	Views
Non-ACID compliant upgrades Upgrading	5	1566	October 1, 2016
Data copying during migration Migration	4	2464	January 2, 2015
Aerospike Partitions Migration Internals	3	667	July 25, 2019
Can we change the time one node take to join cluster after restart? Monitoring	5	771	June 3, 2022
Losing records after node fails Configuration	3	1492	May 24, 2015

Cluster upgrade

Related topics