Non-ACID compliant upgrades

Joel_Eidsath · September 16, 2016, 6:12pm

Do not upgrade another node in your cluster until the Migrates column shows zeroes. Interrupting data migration may cause data loss.

If you are upgrading a large cluster, you may begin upgrading the next server after a delay of 30-60 seconds, without allowing the data migration to complete. This method may not be fully ACID compliant.

Does anyone have any experience with this? What parts of ACID are lost exactly? I assume that it’s only update propagation that is affected, with no data loss (for a large enough cluster).

Our migrations take 2-3 days after a server restart, so we are eyeballing this method.

kporter · October 1, 2016, 12:45am

When running replication-factor 2, new writes (updates or creates) can be lost when 2 or more nodes have individually been restarted since the last migration completed. It isn’t very difficult for this to happen, you need to have only written to the same record multiple times (with different updates) since the first node restarted.

This isn’t an issue for replication-factor > 2, as long as only a single node is restarted at a time.

Joel_Eidsath · October 1, 2016, 1:05am

We run at replication factor 3. How long can we space the restarts apart from each other (assuming we wait for nodes to come back up completely)? Should we expect full ACID compliance?

kporter · October 1, 2016, 1:17am

Once all nodes agree on the cluster_key and the migrate_allowed stat is true, it is safe to proceed to the next node.

While migrating, reads of records that have changed during migrations can be stale. Writes based on stale read generations will fail the generation check. With a performance cost, you can prevent the stale read issue by using read-consistency-level-override all.

Joel_Eidsath · October 1, 2016, 1:30am

Thanks! Can you be a bit more clear about this part?

Writes based on stale read generations will fail the generation check.

What will happen to a write that fails the generation check?

kporter · October 1, 2016, 1:34am

The write will be rejected by the server. The generation data allows the client to do a check-and-set type operation, see http://www.aerospike.com/docs/client/java/usage/kvs/write.html#read-modify-write.

Topic		Replies	Views
Cluster upgrade	7	1248	May 29, 2017
Aerospike Partitions Migration Internals	3	665	July 25, 2019
Upgrade from 4.3.0.6 to 4.4.0.6, data rollback Upgrading	11	1914	January 11, 2019
Bad performance after upgrade due to migrations Upgrading	9	3261	July 8, 2015
Strange behavior during migration Migration	6	3443	June 18, 2015

Non-ACID compliant upgrades

Related topics