Non-ACID compliant upgrades

Do not upgrade another node in your cluster until the Migrates column shows zeroes. Interrupting data migration may cause data loss.

If you are upgrading a large cluster, you may begin upgrading the next server after a delay of 30-60 seconds, without allowing the data migration to complete. This method may not be fully ACID compliant.

Does anyone have any experience with this? What parts of ACID are lost exactly? I assume that it’s only update propagation that is affected, with no data loss (for a large enough cluster).

Our migrations take 2-3 days after a server restart, so we are eyeballing this method.

When running replication-factor 2, new writes (updates or creates) can be lost when 2 or more nodes have individually been restarted since the last migration completed. It isn’t very difficult for this to happen, you need to have only written to the same record multiple times (with different updates) since the first node restarted.

This isn’t an issue for replication-factor > 2, as long as only a single node is restarted at a time.

We run at replication factor 3. How long can we space the restarts apart from each other (assuming we wait for nodes to come back up completely)? Should we expect full ACID compliance?

Once all nodes agree on the cluster_key and the migrate_allowed stat is true, it is safe to proceed to the next node.

While migrating, reads of records that have changed during migrations can be stale. Writes based on stale read generations will fail the generation check. With a performance cost, you can prevent the stale read issue by using read-consistency-level-override all.

Thanks! Can you be a bit more clear about this part?

Writes based on stale read generations will fail the generation check.

What will happen to a write that fails the generation check?

The write will be rejected by the server. The generation data allows the client to do a check-and-set type operation, see