Special considerations for upgrading to version 4.3 (RF>2 + Rack-Aware + AP)


#1

Special considerations for upgrading to version 4.3

Summary

Review this article if you are upgrading to 4.3 and have the following configurations in your cluster:

Details

When upgrading to 4.3 (with the above mentioned configuration), migrations may get stuck (refer to the Monitoring Migrations article for confirming migrations are indeed stuck).

This is due a change in the partition balance algorithm introduced for the prefer-uniform-balance feature in 4.3. In more technical terms, the problem is due to the nodes running the new version have shifted the working master to the right-most identically versioned node. Since the nodes running the previous versions didn’t do this, when 2 copies of a partition end up with the same version on nodes running different server versions, migrations would get confused and stuck for such partition. Once all nodes are upgraded, this is of course not an issue anymore.

Workaround

Upgrading from 4.2 to 4.3 with persistent storage

Simply proceed with the rolling upgrade as usual. If a node is not restarted in a timely fashion, migrations may get stuck. Simply bring the node back in the cluster and proceed with the rolling upgrade.

Upgrading from a version prior to 4.2 or upgrading from 4.2 to 4.3 without persistent storage

(A) If data is persisted, upgrade to 4.2 prior to upgrading to 4.3. The first upgrade will require erasing the data on the persistent storage, one node at a time, as detailed on the 4.2 upgrade documentation. Proceed with a second upgrade to 4.3 following the instructions above.

(B) For data in memory only namespaces or when directly upgrading from a version prior to 4.2 directly to 4.3 (requiring erasing all data), upgrade one rack at a time, waiting for migrations to finish their activities (either completing or getting stuck) prior to proceeding to the next rack. It may be helpful to stop migrations while a rack is down to avoid running out of capacity on the remaining racks by setting migrate-threads to 0 prior to taking a rack down, and setting it back to its original value (default is 1) when the rack is upgraded and back up. This may cause some partitions to temporarily have one less copy in the cluster.

(C) An alternative to (B), if running on 3 racks, would be to take 2 racks down, upgrade them both, bring them back, wait for migrations to complete and then upgrade the last rack. It may be helpful to stop migrations while the 2 racks is down to avoid running out of capacity on the remaining racks by setting migrate-threads to 0 prior to taking a rack down, and setting it back to its original value (default is 1) when the rack is upgraded and back up. In this case, when 2 racks are down, migrations should complete, since with one remaining racks, there will not be any conflicts between 2 copies of the same partition that would confuse the migrations and get them stuck. This of course means running with only 1 copy of each partition temporarily.

Note

  • This is a one-off issue. Once the upgrade to 4.3 is completed, this issue will not happen as all nodes would be running the same distribution algorithm.
  • In all options listed above, no data is lost, unless something unexpected happens during the procedure, like additional nodes going down.

Keywords

UPGRADE 4.3 RACK-AWARE MIGRATIONS STUCK

Timestamp

Aug 31 2018