Dual CPU Socket and Aerospike

Ooooh, interesting. Let me join your interesting conversation, guys.

Maybe first a little bit of background on the kernel version requirement. For our pinning mechanism (CPU as well as NUMA) we need to be able to ask the kernel which CPU it uses to handle a given client TCP connection. We’ll then make sure that we process transactions from that connection on the same CPU. This keeps all processing local to one CPU, end to end. Unfortunately, only Linux versions 3.19 and later allow you to ask the kernel which CPU it uses to handle a TCP connection.

Now, let me talk a little bit about the functionality behind pinning. CPU pinning first:

  • It uses all available CPUs across all NUMA nodes.
  • It reconfigures the network interface(s) to evenly distribute network packet and TCP/IP processing across all CPUs. This exploits things like RSS and the CPU affinities of network queue interrupts. That’s why pinning is not compatible with irqbalance.
  • It tells Aerospike to process transactions on the same CPU that did the network packet and TCP/IP processing.

Here’s how NUMA pinning differs from that:

  • It tells Aerospike to only use the CPUs of one NUMA node. The NUMA node is selected with the --instance command line option. That’s why you have to run multiple asd processes - one per NUMA node.
  • It migrates any shared memory (primary indexes) to the correct NUMA node, if necessary. Suppose that you ran without NUMA pinning first and then switched to NUMA pinning. Then running without NUMA pinning would have allocated shared memory pretty evenly from all NUMA nodes. So, you’d end up with primary index accesses that go to the wrong NUMA node(s) - unless you migrated the shared memory to the correct NUMA node.
  • Just like CPU pinning, it also reconfigures the network interface(s) to make sure that we stay on a single CPU end to end.

Now for numactl. Yes, you can mimic some of the above functionality with it:

  • You can ensure that the asd threads all run on one NUMA node and that memory gets allocated from one NUMA node.
  • You cannot ensure that transactions are processed on a single CPU end to end. We don’t know which CPU the kernel used to process network packets and to run the TCP/IP stack. That’s only possible with kernel 3.19 and later.
  • You won’t get automatic migration of shared memory. And I’d have to check whether using migratepages on top of numactl would help, i.e., whether migratepages works with shared memory.

So, if you want to run a numactl experiment, then I’d recommend to do this:

  • If you have a primary index around that was created by running asd without numactl, get rid of it, e.g., by rebooting the machine to get rid of the shared memory segments. (Or manually destroy them using ipcs and ipcrm.).
  • Run asd under numactl as you proposed.

One general piece of advice when running multiple asd processes on a single machine: Use rack-awareness. Suppose you have a replication factor of 2 as well as 2 NUMA nodes per machine. Without rack-awareness, both copies of a record could end up on the same machine. Turn each machine into a rack, so that both asd processes on a machine “are in the same rack.” Rack-awareness now ensures that the two copies of a record are in different racks, i.e., on different machines.

This first incarnation of NUMA pinning focuses on super-high-TPS in-memory workloads whose performance is limited by RAM latency. In particular, it only reconfigures network interfaces. It doesn’t currently touch anything storage-related. When you have SSD-backed namespaces, it can currently happen that performance actually degrades, e.g., because irqbalance needs to be turned off. I think that this might be what Albot observed.

All in all, yes, numactl is a partial replacement for the NUMA pinning feature. Be warned, however, that it may not do a lot for SSD-backed namespaces. But it might still be worth an experiment.

3 Likes