How to efficiently perform a full cluster reboot for OS patching?


#1

Hello! First time here, please move to another category if inappropriate (this in not an actual Aerospike upgrade question).

We’re using community edition and I have been asked to test various scenarios for when OS patches are applied to all our servers and reboots are required.

As I understand it we can:

  1. leave the site in service, performing rolling reboots waiting for all migrations to complete before moving to the next server.

  2. take the site out of service, stop asd, delete the data, reboot all the servers, and reload the data.

  3. take the site out of service, stop asd, reboot all the servers, and ? magic happens ? :wink:

I did not go through any training yet, what I found in search that might be relevant:

I think there should be something straightforward in this that I am missing.

How do others accomplish this task?

Thanks in advance! Jay


#2

If you are using a version after 3.13 or you have changed the paxos-protocol in 3.13 to v5 then you no longer need to wait for migrations to complete between server restarts. But you should wait for the restarted node to join the cluster, by observing that the cluster size returns to the number of nodes in the cluster, before restarting the next node.

Sure, that would work, but you would lose your data ;).

Similar to above but you get to keep your data.

For the two solutions where you keep your data, if you have used client initiated non-durable deletes then you could have some bonus data since non-durably deleted records could return on cold-start.


#3

So a little more info as I learn where to look: build C-3.11.0.2 paxos-single-replica-limit 1 paxos-protocol v4 nodes 4 replication-factor 2

  1. leave the site in service, performing rolling reboots waiting for all migrations to complete before moving to the next server.

So I think of this as the standard reference procedure. Cluster stays in service queries, loads, and scans operate normally. What I didn’t like was the time required for even ONE server rebooting (over 69 hours?):

Mar 09 2018 18:10:45 GMT: INFO (partition): (partition_balance.c:1162) {xxxyyyzzz} re-balanced, expected migrations - (697 tx, 707 rx)
Mar 09 2018 20:53:40 GMT: INFO (info): (ticker.c:406) {xxxyyyzzz} migrations: remaining (611,621) active (1,1) complete-pct 12.25
Mar 09 2018 20:53:50 GMT: INFO (partition): (partition_balance.c:1162) {xxxyyyzzz} re-balanced, expected migrations - (697 tx, 697 rx)
Mar 09 2018 20:53:50 GMT: INFO (info): (ticker.c:406) {xxxyyyzzz} migrations: remaining (697,697) active (1,0) complete-pct 0.00
March 10/11 just % increases
Mar 12 2018 13:23:49 GMT: INFO (info): (ticker.c:406) {xxxyyyzzz} migrations: remaining (2,3) active (1,0) complete-pct 99.64
Mar 12 2018 13:35:09 GMT: INFO (info): (ticker.c:406) {xxxyyyzzz} migrations: remaining (1,0) active (1,0) complete-pct 99.93
Mar 12 2018 13:35:19 GMT: INFO (info): (ticker.c:409) {xxxyyyzzz} migrations: complete
Mar 12 2018 13:35:09 GMT: INFO (info): (ticker.c:406) {xxxyyyzzz} migrations: remaining (1,0) active (1,0) complete-pct 99.93
Mar 12 2018 13:35:19 GMT: INFO (info): (ticker.c:409) {xxxyyyzzz} migrations: complete
  1. take the site out of service, stop asd, delete the data, reboot all the servers, and reload the data.

The concept was to try and bypass all the re-balancing and index loads. The downside would be the Cluster goes out of service - no queries or scans. With my test cluster the clearing of the devices took just under an hour and the fresh load took 7.5 hours.

  1. take the site out of service, stop asd, reboot all the servers, and ? magic happens ?

I’m getting ready to do this after I read up some more on non-durable deletes & such :wink:

Thanks!


#4

These are problems that have been addressed in Aerospike Enterprise. See “Fast Restarts” and “Rapid Rebalance”. Fast restarts reduces server restart times from hours to minutes. Rapid Rebalance typically takes about 1/40th the amount of time to rebalance after a restart.


#5

Yes - it looks like

is what I want.

I’m trying to track down who did the design and install on our system.

Just rebooting the whole cluster is taking about 1.5hrs for the indices and then just over 8hrs to complete the re-balance.

I’m still not clear why stopping traffic and dynamically setting migrate-threads to 0 to freeze migrations followed by statically setting the same to freeze on reboot doesn’t help. Waiting for all nodes to rejoin the cluster still shows migrate_partitions_remaining. Shouldn’t Aerospike be able to rescan at that point to minimize migrations before turning migrate-threads back up?


#6

When a node restarts, it assumes other nodes continued and marks its partitions as subset. When rebalance occurs and all versions for a partition are subsets then a full 2 way migration is required - we cannot assume matching subsets are the same subsets.


#7

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.