Dual CPU Socket and Aerospike

Our Aerospike nodes have two CPUs (full dual socket, not dual core), and I would like to know the best way to configure Aerospike. My assumption is cross-socket communication will be slow, and running multiple processes is going to be the optimal way (which was confirmed at the user summit, thanks Andy!).

First is thing to note is OS, we are using Centos 7.5 (kernel 3.10.0-862), this means we do not have access to cpu pinning via https://www.aerospike.com/docs/reference/configuration/#auto-pin ā€“ and I donā€™t think this really helps anyway, as it would leave a single CPU idle.

So what is the recommended approach here? numactl --cpunodebind=0 --membind=0 asd ? what is the recommended approach for dealing with ports?

Any advice would be appreciated.

It might not be worth it to go this far with tuningā€¦ Can you tell us more about your setup? What hardware do you have? What are your namespace configurations like? Are you using disk-backed namespaces? Whats your latency currently at? Iā€™ve not heard of anyone running numactl to try to emulate this, but curious to see where you land.

We have a few different clusters, but primarily we are running several disk-backed namespaces, and we are generally constrained by RAM not disk (ie high record count). Clusters range in size from 3 to 20, and all are using Dell hardware (R730s) with dual sockets. Latency is generally fine, however the thing that is hurting us is inconsistent latency. We are using Samsung SM863s at the moment, which are decently fast, but not super consistent.

It may not make a huge difference, but seems worth investigating to me.

In my experience, some drives can be notorious for ā€˜taking a breakā€™ every once in a while. Have you ran ACT on them and done capacity planning exercise? If you enable micro-benchmarks I would be very surprised to find if it wasnā€™t related to your disk or networkā€¦ or maybe even just check your avqu-sz for those peaks if you have some kind of way to monitor that and correlate

Yep I thrashed them with ACT, and there is definitely periods of latency with these drives. I know we can do better there, and the next round of hardware we will. But for now Iā€™d like to get the most out of the existing hardware.

CPU pinning really comes in handy when youā€™re pushing the envelope of the cpu/network, typically on in-memory configurations with ridiculous amounts of TPS. Iā€™ve actually seen auto-pin degrade performance on a disk-backed solution. I forget what the reason the support folks gave me on this. @kporter might know? I donā€™t mean to be a downer, but I donā€™t think it will help improve your latencies in any statistically significant amount. That being said, if you just want to try it out for fun as I know I would haha, Iā€™d try testing auto-pin out in a test environment where you can simulate your workload and just use Ubuntu instead of trying to using numactl and other tricks to bind things. That way, if you can get it working the ā€˜rightā€™ and ā€˜documentedā€™ way and see improvements - maybe its worth chasing with numactl? Just my thoughts though!

Ooooh, interesting. Let me join your interesting conversation, guys.

Maybe first a little bit of background on the kernel version requirement. For our pinning mechanism (CPU as well as NUMA) we need to be able to ask the kernel which CPU it uses to handle a given client TCP connection. Weā€™ll then make sure that we process transactions from that connection on the same CPU. This keeps all processing local to one CPU, end to end. Unfortunately, only Linux versions 3.19 and later allow you to ask the kernel which CPU it uses to handle a TCP connection.

Now, let me talk a little bit about the functionality behind pinning. CPU pinning first:

  • It uses all available CPUs across all NUMA nodes.
  • It reconfigures the network interface(s) to evenly distribute network packet and TCP/IP processing across all CPUs. This exploits things like RSS and the CPU affinities of network queue interrupts. Thatā€™s why pinning is not compatible with irqbalance.
  • It tells Aerospike to process transactions on the same CPU that did the network packet and TCP/IP processing.

Hereā€™s how NUMA pinning differs from that:

  • It tells Aerospike to only use the CPUs of one NUMA node. The NUMA node is selected with the --instance command line option. Thatā€™s why you have to run multiple asd processes - one per NUMA node.
  • It migrates any shared memory (primary indexes) to the correct NUMA node, if necessary. Suppose that you ran without NUMA pinning first and then switched to NUMA pinning. Then running without NUMA pinning would have allocated shared memory pretty evenly from all NUMA nodes. So, youā€™d end up with primary index accesses that go to the wrong NUMA node(s) - unless you migrated the shared memory to the correct NUMA node.
  • Just like CPU pinning, it also reconfigures the network interface(s) to make sure that we stay on a single CPU end to end.

Now for numactl. Yes, you can mimic some of the above functionality with it:

  • You can ensure that the asd threads all run on one NUMA node and that memory gets allocated from one NUMA node.
  • You cannot ensure that transactions are processed on a single CPU end to end. We donā€™t know which CPU the kernel used to process network packets and to run the TCP/IP stack. Thatā€™s only possible with kernel 3.19 and later.
  • You wonā€™t get automatic migration of shared memory. And Iā€™d have to check whether using migratepages on top of numactl would help, i.e., whether migratepages works with shared memory.

So, if you want to run a numactl experiment, then Iā€™d recommend to do this:

  • If you have a primary index around that was created by running asd without numactl, get rid of it, e.g., by rebooting the machine to get rid of the shared memory segments. (Or manually destroy them using ipcs and ipcrm.).
  • Run asd under numactl as you proposed.

One general piece of advice when running multiple asd processes on a single machine: Use rack-awareness. Suppose you have a replication factor of 2 as well as 2 NUMA nodes per machine. Without rack-awareness, both copies of a record could end up on the same machine. Turn each machine into a rack, so that both asd processes on a machine ā€œare in the same rack.ā€ Rack-awareness now ensures that the two copies of a record are in different racks, i.e., on different machines.

This first incarnation of NUMA pinning focuses on super-high-TPS in-memory workloads whose performance is limited by RAM latency. In particular, it only reconfigures network interfaces. It doesnā€™t currently touch anything storage-related. When you have SSD-backed namespaces, it can currently happen that performance actually degrades, e.g., because irqbalance needs to be turned off. I think that this might be what Albot observed.

All in all, yes, numactl is a partial replacement for the NUMA pinning feature. Be warned, however, that it may not do a lot for SSD-backed namespaces. But it might still be worth an experiment.

3 Likes

Oh, and once you have your asd processes up and running, maybe double-check their /proc/${ASD_PID}/numa_maps to make sure that memoryā€™s been allocated on the correct NUMA nodes.

Great info, thank you everyone.