We have multiple deployments of 3-node Aerospike clusters. In one particular instance, the resync at midnight UTC is taking place at 6pm local time, during heavy load, and causing the cluster to splinter. This, in turn, causes high CPU load as the cluster continues trying to resync, impacting application performance for a time.
Is there a way to reschedule this sync so it runs at say, 6am UTC instead of midnight? Perusing the documentation, I’ve found no such configuration options or way to internally list or adjust built-in database tasks.
3.6.3 is the version included with this particular software stack, running on CentOS 6.5. 4GB RAM, 2GB currently assigned to Aerospike.
Hm, we were informed that at midnight UTC, it will check and resync the cluster, causing a brief spike in CPU load.
Rebooting the principal server helped for a couple of days as described here, but after that the cluster splintered and couldn’t recover for nearly three hours due to suddenly high CPU load as seen when resyncing.
One of the details is that this CPU load spike occurs at 00:01:57 UTC, followed immediately by the cluster exploding, which is why it seemed to be a scheduled task within Aerospike.
Could you also provide the log lines containing “nsup-start” or “nsup-done”.
There isn’t any task scheduled at midnight UTC but that time may happen coincide with an nsup start (though this isn’t supposed to cause a rebalance). Rebalances occur when there has been a network disruption, these include adding/removing a node as well as one or more nodes becoming unavailable for a heartbeat timeout period.
BTW, the latency during rebalance was significantly improved with Aerospike 3.11.1 and the clustering algorithms became more robust with Aerospike 3.13.0.1 after switching to paxos-protocol v5.