Aerospike Server connections vs CPU cores

Hello everyone, I’m not sure if this is the right place to post this question. I’m trying to understand how Aerospike server works under the hood.

We are on Aerospike 7.2, and our Java client is configured to have min 50 connections open to each AS node. These connection pools are mostly underutilized but we want to have connection ready in case of burst of traffic. What we have noticed is that often some AS nodes are slower, and when we dynamically lower client timeout from 1000ms to 100ms some connections were closed and the new ones opened. After that the affected AS node started working faster, i.e. responding faster. Please note that we only use sync aerospike get operations, there are no writes in these metrics.

As shown in the Grafana, at 20:10 we decreased timeout from 1000ms to 100ms and the AS node C10 had some timeouts. After that it started working better than before, even after we reverted the timeout back to 1000ms.

CPU utilization for the C10 node is also shown on the last graph.

I’m trying to understand how Aerospike works under the hood, and why this might be happening. Any help is appreciated.

What are you measuring for latency? It’s probably that dropping the timeout is just limiting the tail latency being reported which skews the latency reporting down (Ten 1000ms calls + Thousands of 1ms calls vs Ten 100ms calls + Thousands of 1ms calls). I would be suspicious of hotkeys and workload distribution more than anything.

The CPU usage graph is baffling and makes me further suspect hot/warm keys - or very large keys mixed in with small ones. Are these all the same machines? What kind of infra/deployment is this?

We are using Micrometer Timer to measure the operate call, we use operate to basically just pick the bins and parts of the bins that we need. When we update to server 8.1 maybe Lazy bin loading will help here. We are also looking into the values of records and trying to find maybe some large records that we are fetching often. The biggest record in the namespace is 1.03MB, I’ve attached the complete histogram. But I can’t be sure of the frequency of fetching records from these histogram bins.

The interesting thing about our experiment is that when we reverted AS timeout back to 1000ms the issue didn’t reappear, almost like having it at 100ms for couple if minutes (maybe even seconds) fixed it somehow. After about 5-6 hours the same node C10 slowly increased in latency until is got back to the same level as before timeout change, but the other AS nodes during that period remain pretty much flat.

As for our infra, we have 3 machines in 3 racks. Each machine has 2 CPUs (each 96 cores) and we are running 2 AS server processes, each pinned to single CPU, has its own NIC, and also have NUMA nodes configured for memory locality. Also, we have configured all of the best practices found in AS documentation. Clients are rack-aware and we have replication factor 2, meaning that we have master record and one replica for each entry. So, we only have 66.6% of our namespace in each rack. In the graph I posted A10-A31 is first rack, B is second and C is third. Our services are deployed in the same racks using kubernetes for orchestration.

One thing that is suspicious to me is the fact that we have over-provisioned number of connections in our AS client configuration to 50 (for bursts of traffic), and when inspecting the source code of the AS client I see that connection pool behaves like a stack, meaning that most of the time only several connections from top of the stack will be used and all the other will just remain open but not used. This, and the fact that timing out some connections and then making new ones fixes the latency issue, makes me think that AS client connections are somehow bound to AS server CPU cores and that somehow closing the connections and making new ones better redistributes the load across AS server CPU cores.

1 Like