General questions on rolling restart

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

General questions on Rolling Restart

Abstract

From time to time it may be required to do a rolling restart of the cluster in order to upgrade the Kernel or the Aerospike server version, or make non-dynamic configuration changes. A rolling restart is defined as consecutive restarts of individual nodes such that only one node is down at any time and with the aim that at the end of the process all nodes have been restarted.

Each time a node leaves or joins the cluster, unless migrate-fill-delay has been configured, the data will immediately start rebalancing (refer to migrations for details).

As part of the new cluster protocol introduced in version 3.13 (refer to the 3.13 upgrade guide for details on the upgrade procedure), partial partitions are kept in the cluster until migrations for the relevant partition is complete (the partition has as many full copies available in the cluster as the replication factor dictates). It is therefore not necessary to wait for migrations to complete between each node restart, except if there are namespaces with either:

If not in one of the above situations, simply wait for a node to have rejoined the cluster prior to taking the next node down.

One way to check if a node has joined the cluster back is to check the cluster size across all nodes. For example:

asadm>show statistics like cluster_size -flip
~Service Statistics (2018-01-01 10:00:00 UTC)~
            NODE   cluster_size
10.68.00.01:3000              7
10.68.00.02:3000              7
10.68.00.03:3000              7
10.68.00.04:3000              7
10.68.00.05:3000              7
10.68.00.06:3000              7
10.68.00.07:3000              7
Number of rows: 7

Note: As of version 4.3, the cluster-stable command can be directly used to determine whether a node has rejoined the cluster.

Application and Server side considerations when performing a rolling restart

  • Potential application timeouts: Every time a node is stopped, some timeouts may occur on the client library side due to the short delay for the partition map to be updated on the client or transactions that were in flight and didn’t reach the node before it was taken down. Depending on the client side policy (refer to the Understanding timeout and retry policies article for details) such issue may or may not be noticed on the application side. For server versions 4.3.1.3 and above, the quiesce info command can be used in order to handle this situation. Refer to the documentation on Quiescing a Node for further details.

  • Added latency due to duplicate resolution: Adding nodes with data back in a cluster causes some transactions to potentially require duplication resolution as there would be multiple versions of the same record present.

    • Note: strong-consistency enabled namespaces always duplicate resolve. The configuration to disable duplicate resolution is not available for such namespaces. Refer to Strong Consistency documentation for more details.

    • Write transactions - For AP namespaces, Aerospike will by default duplicate resolve across the different versions of a record for write transactions to ensure the correct version is used (based on the configured conflict-resolution-policy. This could lead to write latency spikes during ongoing migration. Waiting for migrations (delta or lead migrations only if migrate-fill-delay is in use) to complete before taking the subsequent node down would not require any duplicate resolution to take place. Finally, for some use cases, and if consequences are understood, it may be acceptable to disable the duplicate resolution using the disable-write-dup-res configuration parameter. Note that disabling write duplicate resolution might lose some updates done during migrations.

    • Read transactions - For read transactions in an AP namespace, the client’s policy drives whether to duplicate resolve or not and the server can enforce the behavior through the read-consistency-level-override configuration parameter. Since read duplicate resolution is off by default, there could be stale reads until migrations have completed. As for the previous point, though, waiting for migrations (delta or lead migrations only if migrate-fill-delay is in use) to complete before taking the subsequent node down would not require any duplicate resolution to take place.

  • Potential increased load caused by migrations: Migrations will cause data to be exchanged between nodes which could, based on the overall workload, impact network and/or disk io performance. By default, migration settings are pretty low to avoid much impact. Refer to the Tuning Migrations documentation for further details. For server versions 4.3.1 and above, migrations can be delayed by using configuration migrate-fill-delay. Refer to Delay Migrations documentation for more details.

  • Memory consideration: Proceeding with a rolling restart without waiting for migrations to complete between each node could temporarily increase the memory consumption across the nodes in the cluster. As a node is taken down, remaining nodes will start receiving records for partitions previously owned by that node which will be held on until the migration for such partitions is completed. If another node is taken down prior to the completion of the migration, further partitions can start populating, etc… this could, in some cases, create a non negligible temporary increase in the total number of records owned by each node (hence memory and disk), which will eventually subside when migrations complete at the end of the rolling restart procedure.

Versions running the older cluster protocol (prior to 3.13)

Depending on the nature of the write workload, refer to the following two steps in order to mitigate the risk of losing updates to records during migrations:

  1. Stop all write traffic to the cluster until the rolling restart is complete (last node rejoined the cluster). This guarantees updates will not be lost during migration, as long as write duplicate resolution has not been disabled. If write load remains while nodes are being restarted, it is possible records may not be available until a node has re-joined the cluster, and some updates may be lost during conflict resolution.

  2. During a rolling restart, wait for migrations to complete before taking the next node out of the cluster. Setting disable-write-dup-res to true will disable write duplicate resolution which can improve write performance during migrations but may also result in lost updates.

Keywords

CLUSTER ROLLING RESTART

Timestamp

September 2019