One node in cluster is going in maintenance. Please suggest approaches to avoid any downtime

piyush123 · March 26, 2018, 1:03pm

I am running a 3 node cluster in production. I am using aerospike community edition(3.15.0.1).

Current config of one node

# Aerospike database configuration file for deployments using mesh heartbeats.

service {
       user root
       group root
       paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
       pidfile /var/run/aerospike/asd.pid
       proto-fd-max 15000
}

logging {
       # Log file must be an absolute path.
       file /var/log/aerospike/aerospike.log {
               context any info
       }
}

network {
       service {
               address any
               port 3000
       }

       heartbeat {
               mode mesh
               port 3002 # Heartbeat port for this node.

               # List one or more other nodes, one ip-address & port per line:
               mesh-seed-address-port BOX1 3002
               mesh-seed-address-port BOX2 3002
               mesh-seed-address-port BOX3 3002




               interval 250
               timeout 10
       }

       fabric {
               port 3001
       }

       info {
               port 3003
       }
}



namespace XXXXXX {
       replication-factor 2
       memory-size 40G
       default-ttl 0
       storage-engine device {
       device /dev/xvdb
       write-block-size 1024K
       data-in-memory true
   }
}

One node of my cluster is going under maintenance and needs to be restarted.

What is the best approach to handle it so that my app doesn’t face any downtime?

I was planning to add one new node to the cluster and then remove the node which will be going under maintenance? If I chose to do this, do I need to restart other nodes in the cluster as well because I have mentioned the ips of all nodes in config of each node.

Or Should I stick with the node which will be going under maintenance and just restart it by keeping it part of the cluster?

Albot · March 27, 2018, 12:48am

If 1 node goes down, you won’t face downtime… If you do not have the capacity to lose a node (memory OR disk is ((1/n-1)+1) * maxpct), then yes you would want to add one before taking one out. You don’t need to restart all the nodes for mesh to work with a new ip, or even if you have an old ip laying around. You can leave in a defunct IP and the heartbeat to that particular node will fail while its undergoing maintenance, and thats fine but you can also tip-clear (asinfo command) to remove it if you’d like, and when you introduce the new node as long as it has one of the online servers in it’s mesh configuration it will join the cluster and all nodes will discover each other through the gossip protocol. https://www.aerospike.com/docs/architecture/clustering.html https://www.aerospike.com/docs/reference/info/#tip https://www.aerospike.com/docs/reference/info/#tip-clear

piyush123 · March 27, 2018, 10:58am

In my app I have mentioned all the ips of the cluster. Would that create an issue if I remove one box now? Would aerospike client be able to handle it?

kporter · March 27, 2018, 3:17pm

Yes, removing a node is fine, so is adding a node but likely unnecessary.

Noticed your config is missing the final brace ‘}’, is that a copy paste error?

piyush123 · March 28, 2018, 4:03am

Yes, that’s a copy paste error. Thanks for your help

Topic		Replies	Views
Replication issue : all nodes down when synchronizing after a node restart Configuration	9	2357	November 22, 2016
Community Edition - How to remove a node from a cluster with replica 1 without losing data Operations	5	1010	September 25, 2019
Removing a node without causing client failure How Aerospike Works	6	4452	September 23, 2016
Issue of removing one node from mesh mode cluster	1	1281	June 8, 2017
Moving 2 node of a cluster to different rack	2	912	May 30, 2017

One node in cluster is going in maintenance. Please suggest approaches to avoid any downtime

Related topics