Understanding linux memory usage reporting

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

The RAW data

JEMalloc statistics - can be obtained by querying aerospike’s jemalloc allocation information:

Allocated: 136475344544, active: 156852981760, metadata: 7333247744, resident: 174832066560, mapped: 287039291392, retained: 25113395200

log statistics - printed in the aerospike.log file, informing of heap efficiency:

system-memory: free-kbytes 507892900 free-pct 63 heap-kbytes (134803970,154965292,281858048) heap-efficiency-pct 47.8

free - standard output from free -h:

N Total Used Free Shared Buffers Cache
Mem: 757G 478G 278G 53G 226M 262G
-/+ buffers/cache: 216G 540G

ps aux - filtered output from ps aux, showing the asd process

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2660130 176307 47.6 37.6 345120972 281858048 ? Ssl Jun14 49900:50 /usr/bin/asd --config-file
/etc/aerospike/aerospike.conf

Tying the numbers together

  • JEMAlloc → Allocated == log statistics → First number in the triplet
  • JEMAlloc → Active == log statistics → Second number in the triplet
  • JEMAlloc → Mapped == log statistics → Third number in the triplet
  • log statistics → heap_efficiency_pct == JEMAlloc → Allocated as percentage of JEMAlloc → Mapped
  • ps → RSS == JEMAlloc → Mapped
  • ps → VSZ == ps - > RSS + any memory that you may use because you opened shared libraries and/or large files, even if the code didn’t malloc() this space. Not a very useful indicator in our instance.
  • free → Used == used memory + shared + buffers + cache
    • the line underneath the first Mem line in the output shows memory used (including shared) without buffers and cache
    • that line also shows actual free memory, should cache be freed
    • as such, actual memory used by all processes (not shared) == Total - free - Cache - Buffers - Shared
  • ps → RSS == Used + Shared
    • real asd process memory use == ps → RSS - Shared

But what do the numbers mean?

Name Meaning
JEMAlloc → Allocated This part is the still-allocated memory by the process to JEMAlloc. The process is allocating this much, while JEMAlloc will be allocating the “mapped” part. The difference between what the process is allocating from JEMAlloc and what the JEMAlloc is allocating from the OS is the fragmentation that most likely cannot be regained. A huge improvement over fragmentation can be seen in certain scenarios, with no speed penalty, by disabling THP (discussed later in this article).
JEMAlloc → Active The total number of active pages allocated. This is a Page Size * number of pages that JEMAlloc allocated. This will therefore be close to, but larger than the “Allocated” portion, which is the allocated memory rather than full pages.
JEMAlloc → Mapped See below for RSS explanation (since JEMAlloc → Mapped == RSS)
RSS Resident Set Size = how much memory is allocated to that process and is in RAM (including shared memory!). This includes all stack and heap memory and shared libraries, as long as they are in RAM. It does not include memory which is swapped out.
VSZ RSS + any memory the process may or may not touch (available to process should it need that memory). This includes shared libraries and currently open files (even if they have not been mapped to memory yet or never will). This also includes all memory that is swapped out that the process has allocated.
Shared Memory Memory that can be mapped to more than one process. It also stays resident in RAM when a process dies and is restarted, allowing for Aerospike’s Enterprise Edition Fast-Restart.
Buffers Operating system buffers, used to help the OS run smoothly. Should be quite small and can be ignored for the purpose of this article.
Cache The filesystem cache. This is primarily made up of clean and dirty cache. Dirty cache has not yet been flushed to disk (writes). Clean cache is there to allow for faster read access from the filesystem. Using file-backed storage in ASD will result in a lot of dirty cache if you do a lot of writes. When using raw storage (or RAM), cache is not used by asd. It therefore doesn’t make sense to keep it. Also, having less than 1GB of free memory due to cache use can result in a “catch 22” where a malloc cannot complete and the system cannot clear cache until malloc completed. It therefore makes sense to tell the OS to always keep over 1GB of free RAM at the minimum (forcing clean cache to be released and dirty cache to be submitted for writing sooner). The reason for at least a single GB is that the asd process will make allocations to shared memory in 1 GB chunks when needed. As such, you should have a minimum of 1GB plus a little bit for small mallocs in the asd process itself (think 1.5GB as a “safer” threshold). More can be found at: How to tune the Linux kernel for memory performance

Questions

  • Why is the Used part of ‘free’ output showing more than the process asd is using?
    • Because the OS, supporting processes and libraries also take up RAM
  • Why don’t the numbers that are supposed to equal tie in exactly?
    • It is close to impossible to grab a statistic on a very busy system from all outputs exactly at the same time. Hence some numbers may drift a little bit
  • Why is heap efficiency so low?
    • Normally heap efficiency should remain quite high. If you use secondary indexes heavily and perform a lot of insert/delete/update, the heap efficiency will be reduced due to fragmentation. Further reduction is caused by THP (Transparent Huge Pages) which should be disabled and is explained further down this article.
  • Is it possible to see mapped memory at a different value to RSS?
    • Yes. While the RSS shows an accurate value for the amount of memory currently in use, when JEMALLOC maps a 2MB chunk of memory (512 x 4KB pages) this is not immediately reflected in the RSS as JEMALLOC backs the 4 KB pages lazily as they are accessed.
  • Is it possible to map more virtual memory than the system has available?
    • Yes, this is called overcommitting and is controlled by settings within /proc/sys/vm/overcommit_memory.
      • By default, it’s set to 0. This means that the kernel uses heuristics to determine how much more than the actual amount of physical RAM it will allow to be mapped.
      • 1 means that the kernel doesn’t enforce any limit on the mapped memory. Processes can map as much memory as they wish, but if they collectively access more of the mapped pages than there are physical pages, it will likely cause a SIGBUS, if a process accesses a mapped page that the kernel cannot lazily back by physical RAM.
      • 2 means that the kernel allows mapped memory as follows:
        • If /proc/sys/vm/overcommit_kbytes is set to a non-zero value, then the limit is the amount of swap space plus this setting interpreted as KiB.
        • Otherwise, it is the amount of swap space plus a fraction of the amount of physical RAM. The fraction is given as a percentage by /proc/sys/vm/overcommit_ratio. By default, this is set to 50, i.e., the limit by default is the amount of swap space plus 50% of the amount of RAM.

THP

Transparent Huge Pages (THP) is a Linux memory management system that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on machines with large amounts of memory by using larger memory pages. As such, the system will allocate larger pages of memory to JEMAlloc than otherwise requested, to try and reduct overhead of small allocations. Unfortunately, for database systems, such as aerospike, this causes allocations issues in terms of memory which cannot be released. By default, a standard allocation, which JEMAlloc believes it received, is 4KB. The THP though, will assign a 2MB chunk page by default, resulting in a large chunk being occupied and potentially unused, which cannot be freed due to the 4KB page being used. As JEMAlloc performs its own reduction in overheads and fragmentation avoidance, this is counter-productive and results in exactly the opposite. As such, it is a good idea to disable it.

Full information about disabling THP can be found here: How to use, monitor, and disable transparent hugepages in Red Hat Enterprise Linux 6 and 7? - Red Hat Customer Portal

In short, to disable THP on the fly, run this BEFORE starting aerospike:

## redhat
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
## all other distros
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

To verify that the THP has been disabled, you can ‘cat’ the above files. The output will look similar to this:

# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

This shows the 3 acceptable values are: always, madvise and never. The value in square brackets is the currently active value. Therfore THP is disabled (set to never).

To perform the change permanently, you can edit the grub configuration and add the following to the end of the ‘kernel …’ line: transparent_hugepage=never

An example grub conf could look like this (note, only the part mentioned above has been added, no other modification has been made):

menuentry 'Linux Mint 18.1 MATE 64-bit' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-d0bcecde-6975-4a78-a9d5-a6051d9e1223' {
	recordfail
	load_video
	gfxmode $linux_gfx_mode
	insmod gzio
	if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
	insmod part_msdos
	insmod btrfs
	set root='hd0,msdos1'
	if [ x$feature_platform_search_hint = xy ]; then
	  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1  d0bcecde-6975-4a78-a9d5-a6051d9e1223
	else
	  search --no-floppy --fs-uuid --set=root d0bcecde-6975-4a78-a9d5-a6051d9e1223
	fi
        linux	/@/boot/vmlinuz-4.4.0-79-generic root=UUID=d0bcecde-6975-4a78-a9d5-a6051d9e1223 ro rootflags=subvol=@  quiet splash $vt_handoff transparent_hugepage=never
	initrd	/@/boot/initrd.img-4.4.0-79-generic
}

Primary Index Reporting

From Aerospike 4.8.0 a new command has been introduced to give more detailed information around memory used by the primary index. The command is index-pressure and the format is shown below:

$ asinfo -l -v index-pressure
memory:32002048:32002048

The two numbers given above indicate the amount, in bytes, taken by the primary index and the amount of that which is dirty (not flushed to disk). The example above shows the two numbers being equal as Aerospike is running in hybrid storage mode. That means that all pages will be considered dirty as they only ever exist in memory and are never flushed to disk.

The command gains particular utility when Aerospike is running with the index on disk. In an all-flash mode the index-pressure command gives an indication of how far behind the index write-back is lagging.

In all flash, primary indexes are mmap()ed. They are modified as if they were in RAM. When an index entry is touched, the kernel brings the corresponding page from the index drive to RAM. If an index entry is modified, the kernel lazily writes the corresponding modified page from RAM back to the index drive. The RAM the page used then becomes available again for other purposes.

Pages that have been modified, but not yet written back to the index drive are dirty pages. When the write-back process cannot keep up with index modifications, dirty pages pile up in RAM, consume more and more RAM, to a point where the system may run out of memory.

The index-pressure command gives the amount of RAM taken up by a namespace’s primary index pages that are currently cache in RAM as well as how many of them are dirty. The higher the dirty value, the more the write-back is lagging.

An example of the index-pressure command running with index on disk is shown below. Here, of the ~1.5GiB total index around 68KiB is dirty.

$ asinfo -l -v index-pressure
test:1630212096:69632

The less-relevant (to asd) part about shared memory calculation and RSS

If process 1 uses 100GB of memory, process 2 uses 100GB of memory, then total use is 200GB, 100GB per process. Those processes will also show 100GB RSS used each.

Now let’s add shared buffers to the calculations. Let’s say there is 50GB of shared buffers.

Now say that process 1 touched all 50GB and process 2 touched only 20GB.

Process 1 will have touched 30GB that process 2 didn’t. That leaves us with 20GB that both processes touched.

The way linux will calculate that is by doing 20GB/2=10GB per process.

Process 1 RSS will show 100GB + 30GB of shared it touched on it’s own + it’s equal share of bytes of the 20GB it touched with another process = 100+30+10=140GB.

Process 2 is 100GB plus it’s equal share in bytes of what it touched with the other process (10GB) = 110GB.

Keywords

LINUX RAM MEMORY VSZ RSS JEMALLOC SHARED THP HUGEPAGES DIRTY

Timestamp

February 2020