Speed up re-joining a cluster

Hi,

When doing a rolling restart, the restarted node comes up and it can take between 10-15 minutes for it to rejoin the cluster.

Is it normal for it to take this long to rejoin the cluster? How can I improve the time?

I am running enterprise (4.5.0.5, rolling upgrade to 4.8.0.2) with clusters ranging between 10 and 16 nodes.

Does your deployment use Secondary Indexes?

Likely not. Is this something I can see with aql show indexes? If so, then the answer is no.

Do you have any persisted namespaces with data-in-memory true?

If so, these namespaces can take longer to load since they have to load the data into RAM. We have made this a bit faster with “Cool Restart” but it is still slower than a “Fast Restart” of a persistence only namespace. Both “Cool” and “Fast” restarts require that the shared memory index is available (i.e. a clean shutdown and the machine hasn’t been rebooted.)

If you have rebooted the machines then they will need to “Cold Restart” which must rebuild the primary index from disk and if data-in-memory, load the data into memory.

If neither of these apply, could you share your configuration? You may want to reach out to your enterprise support contact to ensure a timely response.

During the rolling restart, can you confirm that the node was gracefully shutdown. The last line prior to the server restart should have been

finished clean shutdown - exiting

Otherwise, a coldstart would occur due to the ungraceful shutdown.

Aerospike Support helped me. Fabric didn’t have the address config set, so it was listening on multiple interfaces. Once I configured it so that it only listened on the default interface, it takes now only 1 min to rejoin a cluster.

Incidentally in the Configuration Reference, I don’t see address with context network and subcontext fabric. Should I?

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

I think you should indeed… let us address that. Thanks for pointing it out.