How to enable rack aware on an existing cluster: best practices

Hi,

I have a cluster with 4 nodes that I want to enable rack awareness on. Currently they are all on the same rack so as part of the process I need to replace 2 of the nodes with nodes on a different rack. I also need to change the config and restart each node as per http://www.aerospike.com/docs/operations/configure/network/rack-aware/ . What is the recommended sequence to do all that? I remember seeing a weird situation on my test cluster where it was complaining that the cluster is was not balanced when I enabled rack aware there so I want to make sure I don’t get into a similar situation. I also want to minimize the number of migrations as each one can take a while.

Thanks!

Are you bringing down 2 nodes and adding 2 fresh nodes on a different rack?

@kporter yes, that’s the plan

Assuming you cannot have downtime

Procedure removed due to problems described in my next post.

If you are able to have downtime,

  1. Configure your existing servers into two racks (the nodes that are staying and the nodes that are leaving).
  2. Configure the new servers into a third rack.
  3. Stop all nodes
  4. Start all nodes
  5. Wait for all mirations to complete
  6. Drop the nodes (since they are a rack it will be safe to drop them both at the same time).

@kporter,

Thanks a lot for your response! I am definitely in the “Assuming you cannot have downtime” case so a couple of clarifying questions:

  • In Step 4) when I restart the nodes this will pick up the rack groups and also cause each node to re-join the cluster and migrate all the data back to it

    • Do I have to restart one node at a time and wait for the migrations to complete? In my experience (without fast restart) that can take a while.
    • While restarting nodes some of them will know about the rack groups while others won’t. Do I have to do anything to avoid getting my cluster into a unhealthy cluster state because the number of nodes in each group is not yet equal?
  • I just tried setting the heartbeat protocol to none and got the following error - is that only needed in mesh setup?

Sep 15 2015 23:52:25 GMT: INFO (info): (thr_info.c::3271) Changing value of heartbeat protocol version to none 
Sep 15 2015 23:52:25 GMT: WARNING (hb): (hb.c::1384) setting heartbeat protocol is only supported in heartbeat mode "multicast"

Thanks!

Currently only the multicast protocol supports changing protocols, so there will not be a way to do this without downtime.

:blush: Actually after discussing this I also learned that the procedure would have had problems; unrestarted nodes in the cluster while doing the restarts would be able to find much of the data and the replication would be directed to the wrong nodes by the unrestarted nodes.

@kporter,

So in the downtime approach – is there a step 0) to bring down all the servers first? If I am restarting servers in step 3) is it safe to have some that think they are in a rack while others who don’t?

Updated the procedure to clarify.

1 Like