Disabling transparent huge pages (THP) for Aerospike


#1

Disabling transparent huge pages (THP) for Aerospike

Background

The physical memory in linux is not directly mapped to virtual memory. This is done in ‘pages’. A single page of virtual memory that gets allocated in physical memory is 4KB in size - even if your application performs a smaller allocation. This 4KB page of physical memory is assigned to the virtual memory allocation for the requesting application.

The problem is that with 4KB chunks, having 16GB of RAM creates 4’194’304 pages. This means the kernel must keep a map of over 4 million page allocations. Those page lookups can be expensive in terms of CPU cycles required to perform said lookups.

In order to alleviate the problem, huge pages were introduced. By default, they come in 2 flavours - 2MB and 1GB. Unfortunately that meant the application developers would need to handle huge page allocations and deallocations, keeping track of memory. This wasn’t an easy task, and so, in Linux, Transparent Huge Pages (THP) were invented. The THP allow linux to allocate a 2MB chunk by the linux kernel for the application. The kernel then also handles the deallocation and defragmentation of said pages. With THP of 2MB size, 16GB RAM requires just 8192 pages. Now that is a much smaller map, that’s much faster to traverse. Some limited testing reveals that on huge allocations and deallocations a machine without THP, the CPU can waste up to 10% of it’s cycles on map lookups, while with THP this becomes just 2% on the same test.

Problem description

Unfortunately, THP comes with some issues. Firstly, the defrag. If you malloc a 1.1MB chunk, you get a 2MB THP. Now if you malloc another chunk of 1.1MB, you will end up with a second 2MB hugepage. This means you are now holding 4MB in 2 hugepages while you have only malloc 2.2MB. This can be an issue where more memory is used than requested.

The kernel has the ability to transform hugepages to normal pages and to defragment THP. This means that the kernel degragmentation will move smaller allocations which fit into a single THP into that one THP, freeing other small pages and THPs. This comes with a cost in the form of latency spike. The problem is that while the kernel runs a defrag on, say 3 THPs, those pages become locked (therefore locking application call to them) until the defrag is done. This causes applications to have miniature spike should it request memory pages which the kernel is defragmenting.

Another issue is that alternative memory allocators, such as the JEMalloc, don’t play nice with THP. It turns out that jemalloc uses madvise extensively to notify the operating system that it’s done with a range of memory which it had previously malloc’ed. Because the machine used transparent huge pages, the page size was 2MB. As such, a lot of the memory which was being marked with madvise(…, MADV_DONTNEED) was within ranges substantially smaller than 2MB. This meant that the operating system never was able to evict pages which had ranges marked as MADV_DONTNEED because the entire page would have to be unneeded to allow it to be reused. This may result with huge amount of memory being lost that cannot be freed. As such, jemalloc statistics will show much less memory usage than the operating system (resident memory) usage of said process.

Check

To check the process’s RSS (resident) and VSZ (virtual) memory usage:

$ ps -eo rss,comm |grep asd
 287039291392 asd

To check the asd process jemalloc statistics:

$ asinfo -v 'jem-stats:'

This will dump the jemalloc statistics to either console (to view: journalctl -u aerospike.service) or to /var/log/(syslog|messages). There will be a printout which will look like this:

Allocated: 136475344544, active: 156852981760, metadata: 7333247744, resident: 174832066560, mapped: 287039291392, retained: 25113395200

The ps -> RSS == JEMAlloc -> Mapped. Or at least it should be close. If it’s really far, you have most likely fallen victim of THP (though a memory-leak cannot be excluded, the THP issue is much more plausible at this stage).

You can also check this by finding the following log line (introduced in version 3.10.1) in your aerospike logs as opposed to running jem-stats (details in the log reference manual:

Jun 14 2018 06:18:54 GMT: INFO (info): (ticker.c:241)    system-memory: free-kbytes 259843576 free-pct 98 heap-kbytes (2518411,2584772,3786752) heap-efficiency-pct 66.5

The heap-efficiency would be affected by fragmentation, normally, of secondary indexes. The number we are interested in is heap-kbytes - particularly the third number in the tripplet. This is the heap_mapped_kbytes, and should give you the same number as the jemalloc stats “mapped”.

See Understanding linux memory usage reporting for more details on how to read memory allocation.

Further review of hugepage numbers can be done in details of the jem-stats dump and using the kernel’s nr_hugepages/meminfo:

$ cat /proc/sys/vm/nr_hugepages
$ cat cat /proc/meminfo |grep -i huge

Solution

Since JEMalloc already performs it’s own memory managment, ensuring low fragmentation of memory, delayed deallocations and other improvements, there isn’t much gain (if at all) from the THP itself. With the use of jemalloc, THP is actually more harmful than helpful, since it ends up with the process hugging memory it thinks it released. Therefore, when using applications which rely on not having latency spikes and use alternative memory managments, it is best to just turn THP off.

To disable hugepages at runtime:

RHEL systems:

echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Other kernels:

echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
echo "never" > /sys/kernel/mm/transparent_hugepage/defrag

Unfortunately, you would need to run this BEFORE aerospike process starts. If it was running before, in order to ensure all previously malloc’d space is freed, you will have to restart asd, preferably with cold start at the very least, preferably, rebooting the whole machine instead. It is therefore best to disable THP at boot time before aerospike starts and then restarting the OS:

In order to disable THP on sysVinit systems (non-systemd):

Create /etc/init.d/disable-transparent-hugepages with the following contents:

#!/bin/bash
### BEGIN INIT INFO
# Provides:          disable-transparent-hugepages
# Required-Start:    $local_fs
# Required-Stop:
# X-Start-Before:    aerospike
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Disable Linux transparent huge pages
# Description:       Disable Linux transparent huge pages, to improve
#                    database performance.
### END INIT INFO

case $1 in
  start)
    if [ -d /sys/kernel/mm/transparent_hugepage ]; then
      thp_path=/sys/kernel/mm/transparent_hugepage
    elif [ -d /sys/kernel/mm/redhat_transparent_hugepage ]; then
      thp_path=/sys/kernel/mm/redhat_transparent_hugepage
    else
      return 0
    fi

    echo 'never' > ${thp_path}/enabled
    echo 'never' > ${thp_path}/defrag

    re='^[0-1]+$'
    if [[ $(cat ${thp_path}/khugepaged/defrag) =~ $re ]]
    then
      echo 0  > ${thp_path}/khugepaged/defrag
    else
      echo 'no' > ${thp_path}/khugepaged/defrag
    fi

    unset re
    unset thp_path
    ;;
esac

Make the file executable:

chmod +x /etc/init.d/disable-transparent-hugepages

Enable startup script:

# on debian/ubuntu
update-rc.d disable-transparent-hugepages defaults
# on RHEL/centos
chkconfig --add disable-transparent-hugepages

In order to disable THP on systemd systems:

First, create a file with the contents of the above startup script, but store it in /usr/local/bin/disable-transparent-huge-pages.sh

Make it executable:

chmod +x /usr/local/bin/disable-transparent-huge-pages.sh

Create /etc/systemd/system/disable-transparent-huge-pages.service with the following:

[Unit]
Description=Disable Transparent Huge Pages

[Service]
Type=oneshot
ExecStart=/bin/bash /usr/local/bin/disable-transparent-huge-pages.sh

[Install]
WantedBy=multi-user.target

Enable the init script:

systemctl daemon-reload
systemctl enable disable-transparent-huge-pages.service

After creating the startup scripts:

In all cases, after creating the startup scripts, it’s best to restart the OS, freeing any and all THPs which may have been hugged by the applications. The THP will be disabled from the time the script runs, so some drivers and applications may still have hold of a few THP, but this won’t be a problem.

Keywords

DISABLE THP TRANSPARENT HUGE PAGES JEMALLOC RSS VSZ MEMORY MADVISE

Timestamp

6/6/2018