How to enable rack aware on an existing cluster: best practices

naoum · September 10, 2015, 3:36pm

Hi,

I have a cluster with 4 nodes that I want to enable rack awareness on. Currently they are all on the same rack so as part of the process I need to replace 2 of the nodes with nodes on a different rack. I also need to change the config and restart each node as per http://www.aerospike.com/docs/operations/configure/network/rack-aware/ . What is the recommended sequence to do all that? I remember seeing a weird situation on my test cluster where it was complaining that the cluster is was not balanced when I enabled rack aware there so I want to make sure I don’t get into a similar situation. I also want to minimize the number of migrations as each one can take a while.

Thanks!

kporter · September 11, 2015, 10:15pm

Are you bringing down 2 nodes and adding 2 fresh nodes on a different rack?

naoum · September 11, 2015, 10:16pm

@kporter yes, that’s the plan

kporter · September 11, 2015, 10:40pm

Assuming you cannot have downtime

Procedure removed due to problems described in my next post.

If you are able to have downtime,

Configure your existing servers into two racks (the nodes that are staying and the nodes that are leaving).
Configure the new servers into a third rack.
Stop all nodes
Start all nodes
Wait for all mirations to complete
Drop the nodes (since they are a rack it will be safe to drop them both at the same time).

naoum · September 15, 2015, 5:36pm

@kporter,

Thanks a lot for your response! I am definitely in the “Assuming you cannot have downtime” case so a couple of clarifying questions:

In Step 4) when I restart the nodes this will pick up the rack groups and also cause each node to re-join the cluster and migrate all the data back to it
- Do I have to restart one node at a time and wait for the migrations to complete? In my experience (without fast restart) that can take a while.
- While restarting nodes some of them will know about the rack groups while others won’t. Do I have to do anything to avoid getting my cluster into a unhealthy cluster state because the number of nodes in each group is not yet equal?
I just tried setting the heartbeat protocol to none and got the following error - is that only needed in mesh setup?

Sep 15 2015 23:52:25 GMT: INFO (info): (thr_info.c::3271) Changing value of heartbeat protocol version to none 
Sep 15 2015 23:52:25 GMT: WARNING (hb): (hb.c::1384) setting heartbeat protocol is only supported in heartbeat mode "multicast"

Thanks!

kporter · September 16, 2015, 1:30am

Currently only the multicast protocol supports changing protocols, so there will not be a way to do this without downtime.

Actually after discussing this I also learned that the procedure would have had problems; unrestarted nodes in the cluster while doing the restarts would be able to find much of the data and the replication would be directed to the wrong nodes by the unrestarted nodes.

naoum · September 16, 2015, 5:54pm

@kporter,

So in the downtime approach – is there a step 0) to bring down all the servers first? If I am restarting servers in step 3) is it safe to have some that think they are in a rack while others who don’t?

kporter · September 16, 2015, 7:46pm

Updated the procedure to clarify.

Topic		Replies	Views
Cluster upgrade for rack awareness Upgrading	7	1653	July 8, 2019
Rack Aware version C-3.13.0.10 Upgrading query	1	1336	July 8, 2019
Rack-Aware Feature is not currently available for Replication Factor greater than 2 Configuration	5	2001	June 1, 2015
Moving 2 node of a cluster to different rack	2	912	May 30, 2017
Rack aware cluster failure causes higher number of objects Deletion durable-deletion	1	2666	August 17, 2015

How to enable rack aware on an existing cluster: best practices

Assuming you cannot have downtime

If you are able to have downtime,

Related topics