One node in cluster is going in maintenance. Please suggest approaches to avoid any downtime


#1

I am running a 3 node cluster in production. I am using aerospike community edition(3.15.0.1).

Current config of one node

# Aerospike database configuration file for deployments using mesh heartbeats.

service {
       user root
       group root
       paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
       pidfile /var/run/aerospike/asd.pid
       proto-fd-max 15000
}

logging {
       # Log file must be an absolute path.
       file /var/log/aerospike/aerospike.log {
               context any info
       }
}

network {
       service {
               address any
               port 3000
       }

       heartbeat {
               mode mesh
               port 3002 # Heartbeat port for this node.

               # List one or more other nodes, one ip-address & port per line:
               mesh-seed-address-port BOX1 3002
               mesh-seed-address-port BOX2 3002
               mesh-seed-address-port BOX3 3002




               interval 250
               timeout 10
       }

       fabric {
               port 3001
       }

       info {
               port 3003
       }
}



namespace XXXXXX {
       replication-factor 2
       memory-size 40G
       default-ttl 0
       storage-engine device {
       device /dev/xvdb
       write-block-size 1024K
       data-in-memory true
   }
}

One node of my cluster is going under maintenance and needs to be restarted.

What is the best approach to handle it so that my app doesn’t face any downtime?

I was planning to add one new node to the cluster and then remove the node which will be going under maintenance? If I chose to do this, do I need to restart other nodes in the cluster as well because I have mentioned the ips of all nodes in config of each node.

Or Should I stick with the node which will be going under maintenance and just restart it by keeping it part of the cluster?


#2

If 1 node goes down, you won’t face downtime… If you do not have the capacity to lose a node (memory OR disk is ((1/n-1)+1) * maxpct), then yes you would want to add one before taking one out. You don’t need to restart all the nodes for mesh to work with a new ip, or even if you have an old ip laying around. You can leave in a defunct IP and the heartbeat to that particular node will fail while its undergoing maintenance, and thats fine but you can also tip-clear (asinfo command) to remove it if you’d like, and when you introduce the new node as long as it has one of the online servers in it’s mesh configuration it will join the cluster and all nodes will discover each other through the gossip protocol. https://www.aerospike.com/docs/architecture/clustering.html https://www.aerospike.com/docs/reference/info/#tip https://www.aerospike.com/docs/reference/info/#tip-clear


#3

In my app I have mentioned all the ips of the cluster. Would that create an issue if I remove one box now? Would aerospike client be able to handle it?


#4

Yes, removing a node is fine, so is adding a node but likely unnecessary.

Noticed your config is missing the final brace ‘}’, is that a copy paste error?


#5

Yes, that’s a copy paste error. Thanks for your help