CPU loading

Hi, everyone. We used C# client of version 3.9.12. All of our services are hosted in Kubernetes (aerospike cluster and services which are working with aerospike). In case some of node is down we have the following problems:

  1. Big CPU increasing for all services which work with aerospike
  2. During some times requests continue send to this node(which is down). We know that this node proxies these requests to other nodes, but it is difficult to control for situations when we wants to make for example maintenance.

Could somebody help how to solve these problems ?

Can you describe the node ‘going down’? Is this because of some issue, or planned maintenance? What is your heartbeat (server) settings, and tend interval (client) set to?

Hi. It happened during planning maintenance. We used defaults for heartbeat and tend interval without any overrides. May be will you advice better configuration for it or something another ?

Is the replication factor 2 or greater for all namespaces? Can you quantify ‘big cpu increase’? For our use case we use enterprise edition, which allows quiesce. With quiesce you can gracefully remove the node without causing errors to clients. I think without quiesce it is normal for some period of errors until heartbeat detects node is gone and tend picks up new partition map - but the high CPU usage is interesting. Have you profiled it? Is this high as in like ‘50% increase on a 1vCPU container’ or higher as in ‘64 saturated cores’? I’m curious if the high cpu load is driven by application design rather than aerospike

Hello. Replication factor is 2 for all namespaces, we use gracefully removing the node. Result is the following - increasing CPU till 100% (initial state is about 30 - 40 % of usage). CPU loading is continues till service restart. Also we are seeing that during 15 minutes and more this node proxies requests which is continued sent to it.

Hard to guess much more on such symptoms without logs. But as @Albot mentioned, for smooth maintenance, one should make use of the quiescence feature, otherwise, based on the network heartbeat settings, it could take some time for a node to be recognized as having left the cluster, which could be long enough for clients to try to compensate for higher latencies / failed transactions, which would cause a surge of connections (hence CPU).

I think without quiesce it is normal for some period of errors until heartbeat detects node is gone and tend picks up new partition map - but the high CPU usage is interesting. Have you profiled it?

CPU increase can have different root cause, depending on the configuration… it could be caused by connections churning due to clients having to compensate and retry when failing against the node that is going down. It could also be driven by migrations starting. Logs would have details that can help narrown down.