White Paper: Achieving data locality with the Aerospike auto-pin options
Data locality within Aerospike refers to the ability to run the entirety of a transaction for a given key on the same CPU. This is advantageous as it allows optimal usage of the memory hierarchy by maximising use of CPU cache memory, which is faster in terms of access than main memory. The Aerospike
auto-pin functionality combines a number of strategies to maximise data locality and, in doing so, minimises latency. This white paper discusses the various mechanisms involved and how
auto-pin uses these to optimise data locality.
Network related mechanisms
Receive Side Scaling (RSS)
Network interfaces have the concept of queues which can be either transmit or receive queue. There can be multiple queues per interface, each queue has a specific interrupt which, in turn, is linked to a specific CPU. This allows for incoming network packets to be distributed over multiple CPUs. This is Receive Side Scaling (RSS). When the NIC sends a packet to a receive queue the interrupt signals to the CPU that there is data available to read, the queue is picked based upon a hash of various features, the distribution is therefore random but also deterministic. All packets for a given connection will be sent to the same CPU.
Receive Flow Steering (RFS)
When an interrupt is triggered by a packet being assigned to a given network queue, the assigned CPU enters a service routine known as the top half. The time within the top half is minimised to allow a maximum number of interrupts to be handled within a given time period. The top half simply forwards the information from the receive queue interrupt to a kernel thread known as the bottom half. The bottom half runs on the same CPU, each CPU has a single thread for the bottom half. The bottom half can decide to handle the network packet on the same CPU, choose a CPU to handle the packet by hashing a number of characteristics or it can send the packet to the CPU most recently used to process data from the connection sending the packet concerned. The latter option is Receive Flow Steering (RFS). The implication of using RFS is that all packets for a given connection can be handled on the same CPU.
Transmit Packet Steering (XPS)
As a network interface has a receive queue, it also has a transmit queue. When an application needs to write data to the network it passes the data to the kernel which stores the data in a socket send buffer associated with a given connection. The kernel picks a NIC transmit queue to use, this will not change over the life of the connection. The kernel TCP/IP stack then splits the data in the buffer into network packets and dispatches these to the chosen transmit queue. Transmit Packet Steering (XPS) provides the facility to limit each CPU to a given set of transmit queues.
The ideal situation in terms of data locality would be that all operations on a given key would be done on the same CPU. In practise this is not possible as incoming packets can only be routed by connection and it is quite possible that a given key would be operated on from multiple connections. The aim, therefore, is to keep a single transaction within the same CPU. This includes interrupt handling, TCP/IP processing and application processing. This process, in totality, is controlled by the Aerospike
When a connection is opened and the first packet received on the NIC receive queue, a randomised function is used to decide whether a CPU receiving the interrupt continues to process the packet or passes it to another CPU within the system. When the
service-thread accepts the incoming client connection it derives the CPU used by the kernel to process that connection, this would be the CPU chosen by RSS. The
service-thread then decides whether to stay on that CPU or redistribute the connection to the
service-thread on the alternate CPU.
service-thread has chosen a CPU, RFS will ensure that any subsequent packets received from the connection concerned will be processed on that same CPU. XPS is used to ensure that any transmit interrupts are also handled on the same CPU.
In combination, these steps ensure that all interrupt handling, kernel and application processing are done within the same CPU, maximising data locality for a given transaction. This behaviour is activated when the
auto-pin configuration parameter is set to
In certain situations it is possible to take
auto-pin a step further. Not all memory access incurs the same cost. Architectures where some main memory is closer to the CPU, and therefore less expensive to access are known as Non-Uniform-Memory-Access architectures or NUMA. Within a NUMA architecture, memory and CPU are grouped into NUMA nodes. The cores within a given NUMA node have quicker access to the memory within that same node. The optimum data locality would indicate that a given key should always be processed within the same NUMA node. As described above, within a given Aerospike node it is only possible to route connections to a given CPU, not all operations for a particular key. This is solved by having a separate Aerospike daemon per NUMA node. In this way, the NUMA node will be the master for partitions it owns and, therefore, will always handle operations for the keys within those partitions. Combined with the functionality as previously described, this means that all operations for a given key will be performed within the same NUMA node and transactions will be handled end to end by the same CPU. This behaviour is enabled by setting
NUMA. Setting up two Aerospike nodes within the same physical server is a non trivial exercise which requires specific configuration and has implications in sizing. For these reasons it is recommended that customers intending to set
NUMA should do so with the assistance of the Aerospike Client Services team.
Due to the way in which it operates there are certain caveats that must be considered on implementing
auto-pin. Primarily, a minimum Linux kernel of 3.19 is required. Furthermore, due to
auto-pin having control over where Aerospike threads run, if
service-threads have been increased past the default of 1 thread per CPU, these will be re-set, which may impact the use case elsewhere. In this scenario it may be advantageous to discuss
auto-pin with Aerospike Client Services to determine whether removing the custom configuration is the best choice for the use case. Due to the way in which
auto-pin assigns transactions to CPUs other than those associated with a given / queue interrupt, there is an increase in interrupts which is an expected trade off for the gains auto-pin provides.
auto-pin functionality offers optimal data locality for Aerospike nodes and allows Aerospike to take advantage of NUMA. That being said, it does involve low level system tuning and a number of specific requirements, for this reason, if the reason for implementation is a latency issue, a normal investigation into the latency should be completed before
auto-pin is implemented.
Within Linux there is the concept of Receive Packet Steering (RPS). This is a method by which the CPU which receives the interrupt distributes to other CPUs based on a hashing of the packet metadata. This is different from
auto-pin in that it only manages inbound network packets,
auto-pin manages internal Aerospike threads and, therefore, acts for the duration of a given transaction rather than just managing the inbound data flow.