FAQ - which is faster to complete migration - restarting an empty node vs restarting a node with data


#1

FAQ - Cold Restart - Which is faster - restarting a node with data or restarting a node after emptying it?

Details

Assuming a fast restart is not possible (http://www.aerospike.com/docs/operations/manage/aerospike/fast_start), which one of those 2 procedures is faster to complete:

A. Avoid a cold restart by emptying a node ( i.e. run dd or blkdiscard on the SSD device(s) or delete the file(s)) before bringing it back in and wait for migrations to fully repopulate it.

or

B. Wait for a cold restart to load data from persistent storage and then wait for migrations to complete.

Answer

A. Cold restart after emptying a node

In this case, you will have to wait for migrations to repopulate the node that was restarted empty. There will also be some migrations between the other nodes, but not as much, as rapid rebalance would only migrate records that have changed while the node was down.

Rapid rebalance checks records fingerprints (mainly generation / last modified time) 4000 records at a time.

Migrations could take hours depending on the number of NICs and network bandwidth available (1G vs 10G), but can be tuned.

For raw SSD devices, the dd command could also take some time, but blkdiscard or removing the persistent files only take seconds.

B. Cold restart (data preserved on persistent storage)

In this case, the node will first have to load the data from its persistent storage by scanning the entire device(s), which can take a very long time depending on the size of the SSD device(s) and how much data was written on them. Partitioning the device(s) can help speed this part up.

More importantly, a cold restart has the risk of bringing back old data depending on how data was deleted, which can also add on top of this. This could breach one of the high water marks (disk or memory) and trigger evictions which will considerably slow down the startup of the node.

Finally, migrations will also kick in once the node has started and joins the cluster. The time it would take to complete migrations would depend on quite a few factors like SSD i/o performance, CPU capacity, number of records, how long the node was out of the cluster and how much data was written during that time. Having said that, if the cluster is running the new cluster protocol introduced in versions 3.13/3.14, it would not be necessary to wait for migrations to complete before taking another node down.

Conclusion

The best approach would depend on the specific use cases. Tolerance for temporary (until migrations complete) lowering the replication factor of some partitions (when emptying a node before bringing it back in) versus going through a cold restart which could bring back older records (depending on how data were deleted) and take quite some time as well (during which some partition would also have a lower replication factor, but would then be available when the node re-joins the cluster).

Notes

  1. To speed up cold start: How do I speed up cold start?

  2. To speed up cold start eviction: FAQ - What options are available to speed up cold start eviction

  3. With data-in-memory is configured to true, a cold restart cannot be avoided and data needs to be loaded from persistent storage: http://www.aerospike.com/docs/operations/manage/aerospike/fast_start#when-does-fast-restart-not-happen- http://www.aerospike.com/docs/operations/manage/aerospike/cold_start#when-does-aerospike-cold-restart-

  4. To speed up migration: http://www.aerospike.com/docs/operations/manage/migration#speeding-up-the-migration-rate

Keywords

COlD START SPEED FAST RAPID REBALANCE MIGRATION

Timestamp

08/09/2017