Special considerations for upgrading to version 4.3 (Rack-Aware + AP)

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Special considerations for upgrading to version 4.3

Summary

Review this article if you are upgrading to 4.3 and have the following configurations in your cluster:

Details

When upgrading to 4.3 (with the above mentioned configuration), migrations may get stuck (refer to the Monitoring Migrations article for determining whether migrations are stuck).

This is due a change in the partition balance algorithm introduced for the prefer-uniform-balance feature in 4.3. The problem is due to the nodes running the new version shifting the working master to the right-most identically versioned node. Since the nodes running the previous versions didn’t do this, when 2 copies of a partition end up with the same version on nodes running different server versions, migrations could get confused and stuck for these partitions. Once all nodes are upgraded, the issue is resolved.

Another side effect of the partition algorithm change described above could be an impact on client traffic including XDR. The same confusion that can temporarily impact migrations can also impact the inbound client traffic.

Workaround

For both situations described below, the principal node should be upgraded first. The principal node can be found by running the info command via asadm and identifying the node with an asterisk next to it. Alternatively, refer to cluster_principal.

Upgrading from 4.2 to 4.3 with persistent storage

Proceed with the rolling upgrade as usual. If a node is not restarted in a timely fashion, migrations may get stuck. Bring the node back in the cluster and proceed with a rolling upgrade. A recluster command may be issued when migrations get stuck. This would reset the acting master flags and ensure that only one master is the acting master.

Upgrading from a version prior to 4.2 or upgrading from 4.2 to 4.3+ without persistent storage

(A) If data is persisted, upgrade to 4.2 prior to upgrading to 4.3. The first upgrade will require erasing the data on the persistent storage, one node at a time, as detailed on the 4.2 upgrade documentation. Proceed with a second upgrade to 4.3 following the instructions above.

(B) For data in memory only namespaces or when directly upgrading from a version prior to 4.2 directly to 4.3 (requiring erasing all data), upgrade one rack at a time, waiting for migrations to finish their activities (either completing or getting stuck) prior to proceeding to the next rack. It may be helpful to stop migrations while a rack is down to avoid running out of capacity on the remaining racks by setting migrate-threads to 0. It is important to remember to set migrate-threads to 1 or its previously configured value once the rack is upgraded and back up. This may cause some partitions to temporarily have one less copy in the cluster.

(C) An alternative to (B), if running on 3 racks and replication factor >= 3, would be to take 2 racks down, upgrade them both, bring them back, wait for migrations to complete and then upgrade the last rack. IIt may be helpful to stop migrations while a rack is down to avoid running out of capacity on the remaining racks by setting migrate-threads to 0. It is important to remember to set migrate-threads to 1 or its previously configured value once the rack is upgraded and back up. In this case, when 2 racks are down, migrations should complete, since with one remaining rack, there will not be any conflicts between 2 copies of the same partition that would confuse the migrations and get them stuck. This means running with only 1 copy of each partition on a temporary basis.

(D) If migrations get stuck on a cluster with a mixed environment of 4.3+ and versions prior to 4.3. The following options exist to get migrations going.

  • Issue a recluster command on the principal node as each upgraded node joins the cluster.
  • For versions prior to 3.14, trigger a reclustering by temporarily increasing the heartbeat interval of the principal node to 5000 ms. This change would need to be reverted to stabilize the cluster after migrations have started.

Note

  • This is a one-off issue. Once the upgrade to 4.3 is completed, this issue will not happen as all nodes would be running the same distribution algorithm.
  • In all options listed above, no data is lost, unless something unexpected happens during the procedure, like additional nodes going down.

Keywords

UPGRADE 4.3 RACK-AWARE MIGRATIONS STUCK

Timestamp

Aug 31 2018