Ooooh, interesting. Let me join your interesting conversation, guys.
Maybe first a little bit of background on the kernel version requirement. For our pinning mechanism (CPU as well as NUMA) we need to be able to ask the kernel which CPU it uses to handle a given client TCP connection. We’ll then make sure that we process transactions from that connection on the same CPU. This keeps all processing local to one CPU, end to end. Unfortunately, only Linux versions 3.19 and later allow you to ask the kernel which CPU it uses to handle a TCP connection.
Now, let me talk a little bit about the functionality behind pinning. CPU pinning first:
- It uses all available CPUs across all NUMA nodes.
- It reconfigures the network interface(s) to evenly distribute network packet and TCP/IP processing across all CPUs. This exploits things like RSS and the CPU affinities of network queue interrupts. That’s why pinning is not compatible with
irqbalance
. - It tells Aerospike to process transactions on the same CPU that did the network packet and TCP/IP processing.
Here’s how NUMA pinning differs from that:
- It tells Aerospike to only use the CPUs of one NUMA node. The NUMA node is selected with the
--instance
command line option. That’s why you have to run multipleasd
processes - one per NUMA node. - It migrates any shared memory (primary indexes) to the correct NUMA node, if necessary. Suppose that you ran without NUMA pinning first and then switched to NUMA pinning. Then running without NUMA pinning would have allocated shared memory pretty evenly from all NUMA nodes. So, you’d end up with primary index accesses that go to the wrong NUMA node(s) - unless you migrated the shared memory to the correct NUMA node.
- Just like CPU pinning, it also reconfigures the network interface(s) to make sure that we stay on a single CPU end to end.
Now for numactl
. Yes, you can mimic some of the above functionality with it:
- You can ensure that the
asd
threads all run on one NUMA node and that memory gets allocated from one NUMA node. - You cannot ensure that transactions are processed on a single CPU end to end. We don’t know which CPU the kernel used to process network packets and to run the TCP/IP stack. That’s only possible with kernel 3.19 and later.
- You won’t get automatic migration of shared memory. And I’d have to check whether using
migratepages
on top ofnumactl
would help, i.e., whethermigratepages
works with shared memory.
So, if you want to run a numactl
experiment, then I’d recommend to do this:
- If you have a primary index around that was created by running
asd
withoutnumactl
, get rid of it, e.g., by rebooting the machine to get rid of the shared memory segments. (Or manually destroy them usingipcs
andipcrm
.). - Run
asd
undernumactl
as you proposed.
One general piece of advice when running multiple asd
processes on a single machine: Use rack-awareness. Suppose you have a replication factor of 2 as well as 2 NUMA nodes per machine. Without rack-awareness, both copies of a record could end up on the same machine. Turn each machine into a rack, so that both asd
processes on a machine “are in the same rack.” Rack-awareness now ensures that the two copies of a record are in different racks, i.e., on different machines.
This first incarnation of NUMA pinning focuses on super-high-TPS in-memory workloads whose performance is limited by RAM latency. In particular, it only reconfigures network interfaces. It doesn’t currently touch anything storage-related. When you have SSD-backed namespaces, it can currently happen that performance actually degrades, e.g., because irqbalance
needs to be turned off. I think that this might be what Albot observed.
All in all, yes, numactl
is a partial replacement for the NUMA pinning feature. Be warned, however, that it may not do a lot for SSD-backed namespaces. But it might still be worth an experiment.