Special considerations for upgrading to version 4.3
Review this article if you are upgrading to 4.3 and have the following configurations in your cluster:
- rack-aware enabled (2
replication-factor2 or greater
- Namespace running in AP mode (
strong-consistencyset to false).
When upgrading to 4.3 (with the above mentioned configuration), migrations may get stuck (refer to the Monitoring Migrations article for confirming migrations are indeed stuck).
This is due a change in the partition balance algorithm introduced for the
prefer-uniform-balance feature in 4.3. In more technical terms, the problem is due to the nodes running the new version have shifted the working master to the right-most identically versioned node. Since the nodes running the previous versions didn’t do this, when 2 copies of a partition end up with the same version on nodes running different server versions, migrations would get confused and stuck for such partition. Once all nodes are upgraded, this is of course not an issue anymore.
Upgrading from 4.2 to 4.3 with persistent storage
Simply proceed with the rolling upgrade as usual. If a node is not restarted in a timely fashion, migrations may get stuck. Simply bring the node back in the cluster and proceed with the rolling upgrade. A
recluster command may also be issued when migrations gets stuck. This would reset the acting master flags and ensure that only one master is the acting master.
Upgrading from a version prior to 4.2 or upgrading from 4.2 to 4.3+ without persistent storage
A) If data is persisted, upgrade to 4.2 prior to upgrading to 4.3. The first upgrade will require erasing the data on the persistent storage, one node at a time, as detailed on the 4.2 upgrade documentation. Proceed with a second upgrade to 4.3 following the instructions above.
B) For data in memory only namespaces or when directly upgrading from a version prior to 4.2 directly to 4.3 (requiring erasing all data), upgrade one rack at a time, waiting for migrations to finish their activities (either completing or getting stuck) prior to proceeding to the next rack. It may be helpful to stop migrations while a rack is down to avoid running out of capacity on the remaining racks by setting
migrate-threads to 0 prior to taking a rack down, and setting it back to its original value (default is 1) when the rack is upgraded and back up. This may cause some partitions to temporarily have one less copy in the cluster.
C) An alternative to (
B), if running on 3 racks and replication factor >= 3, would be to take 2 racks down, upgrade them both, bring them back, wait for migrations to complete and then upgrade the last rack. It may be helpful to stop migrations while the 2 racks is down to avoid running out of capacity on the remaining racks by setting
migrate-threads to 0 prior to taking a rack down, and setting it back to its original value (default is 1) when the rack is upgraded and back up. In this case, when 2 racks are down, migrations should complete, since with one remaining racks, there will not be any conflicts between 2 copies of the same partition that would confuse the migrations and get them stuck. This of course means running with only 1 copy of each partition temporarily.
D) If migrations get stuck on a cluster with mix environment of 4.3+ and version prior to 4.3. You have the following options to get migrations going.
- Issue a
reclustercommand on the principal node.
- For versions prior to 3.14, trigger a reclustering by temporarily increasing the
heartbeat intervalof the principal node to 5000 ms. This change would need to be reverted to stabilize the cluster after migrations have started.
- This is a one-off issue. Once the upgrade to 4.3 is completed, this issue will not happen as all nodes would be running the same distribution algorithm.
- In all options listed above, no data is lost, unless something unexpected happens during the procedure, like additional nodes going down.
UPGRADE 4.3 RACK-AWARE MIGRATIONS STUCK
Aug 31 2018