Understanding linux memory usage reporting


#1

The RAW data

JEMalloc statistics - can be obtained by querying aerospike’s jemalloc allocation information:

Allocated: 136475344544, active: 156852981760, metadata: 7333247744, resident: 174832066560, mapped: 287039291392, retained: 25113395200

log statistics - printed in the aerospike.log file, informing of heap efficiency:

system-memory: free-kbytes 507892900 free-pct 63 heap-kbytes (134803970,154965292,281858048) heap-efficiency-pct 47.8

free - standard output from free -h:

N Total Used Free Shared Buffers Cache
Mem: 757G 478G 278G 53G 226M 262G
-/+ buffers/cache: 216G 540G

ps aux - filtered output from ps aux, showing the asd process

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2660130 176307 47.6 37.6 345120972 281858048 ? Ssl Jun14 49900:50 /usr/bin/asd --config-file
/etc/aerospike/aerospike.conf

Tying the numbers together

  • JEMAlloc -> Allocated == log statistics -> First number in the triplet
  • JEMAlloc -> Active == log statistics -> Second number in the triplet
  • JEMAlloc -> Mapped == log statistics -> Third number in the triplet
  • log statistics -> heap_efficiency_pct == JEMAlloc -> Allocated as percentage of JEMAlloc -> Mapped
  • ps -> RSS == JEMAlloc -> Mapped
  • ps -> VSZ == ps - > RSS + any memory that you may use because you opened shared libraries and/or large files, even if the code didn’t malloc() this space. Not a very useful indicator in our instance.
  • free -> Used == used memory + shared + buffers + cache
    • the line underneath the first Mem line in the output shows memory used (including shared) without buffers and cache
    • that line also shows actual free memory, should cache be freed
    • as such, actual memory used by all processes (not shared) == Total - free - Cache - Buffers - Shared
  • ps -> RSS == Used + Shared
    • real asd process memory use == ps -> RSS - Shared

But what do the numbers mean?

Name Meaning
JEMAlloc -> Allocated This part is the still-allocated memory by the process to JEMAlloc. The process is allocating this much, while JEMAlloc will be allocating the “mapped” part. The difference between what the process is allocating from JEMAlloc and what the JEMAlloc is allocating from the OS is the fragmentation that most likely cannot be regained. A huge improvement over fragmentation can be seen in certain scenarios, with no speed penalty, by disabling THP (discussed later in this article).
JEMAlloc -> Active The total number of active pages allocated. This is a Page Size * number of pages that JEMAlloc allocated. This will therefore be close to, but larger than the “Allocated” portion, which is the allocated memory rather than full pages.
JEMAlloc -> Mapped See below for RSS explanation (since JEMAlloc -> Mapped == RSS)
RSS Resident Set Size = how much memory is allocated to that process and is in RAM (including shared memory!). This includes all stack and heap memory and shared libraries, as long as they are in RAM. It does not include memory which is swapped out.
VSZ RSS + any memory the process may or may not touch (available to process should it need that memory). This includes shared libraries and currently open files (even if they have not been mapped to memory yet or never will). This also includes all memory that is swapped out that the process has allocated.
Shared Memory Memory that can be mapped to more than one process. It also stays resident in RAM when a process dies and is restarted, allowing for Aerospike’s Enterprise Edition Fast-Restart.
Buffers Operating system buffers, used to help the OS run smoothly. Should be quite small and can be ignored for the purpose of this article.
Cache The filesystem cache. This is primarily made up of clean and dirty cache. Dirty cache has not yet been flushed to disk (writes). Clean cache is there to allow for faster read access from the filesystem. Using file-backed storage in ASD will result in a lot of dirty cache if you do a lot of writes. When using raw storage (or RAM), cache is not used by asd. It therefore doesn’t make sense to keep it. Also, having less than 1GB of free memory due to cache use can result in a “catch 22” where a malloc cannot complete and the system cannot clear cache until malloc completed. It therefore makes sense to tell the OS to always keep over 1GB of free RAM at the minimum (forcing clean cache to be released and dirty cache to be submitted for writing sooner). The reason for at least a single GB is that the asd process will make allocations to shared memory in 1 GB chunks when needed. As such, you should have a minimum of 1GB plus a little bit for small mallocs in the asd process itself (think 1.5GB as a “safer” threshold). More can be found at: Tuning Kernel Memory for Performance

Questions

  • Why is the Used part of ‘free’ output showing more than the process asd is using?
    • Because the OS, supporting processes and libraries also take up RAM
  • Why don’t the numbers that are supposed to equal tie in exactly?
    • It is close to impossible to grab a statistic on a very busy system from all outputs exactly at the same time. Hence some numbers may drift a little bit
  • Why is heap efficiency so low?
    • Normally heap efficiency should remain quite high. If you use secondary indexes heavily and perform a lot of insert/delete/update, the heap efficiency will be reduced due to fragmentation. Further reduction is caused by THP (Transparent Huge Pages) which should be disabled and is explained further down this article.

THP

Transparent Huge Pages (THP) is a Linux memory management system that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on machines with large amounts of memory by using larger memory pages. As such, the system will allocate larger pages of memory to JEMAlloc than otherwise requested, to try and reduct overhead of small allocations. Unfortunately, for database systems, such as aerospike, this causes allocations issues in terms of memory which cannot be released. By default, a standard allocation, which JEMAlloc believes it received, is 4KB. The THP though, will assign a 2MB chunk page by default, resulting in a large chunk being occupied and potentially unused, which cannot be freed due to the 4KB page being used. As JEMAlloc performs it’s own reduction in overheads and fragmentation avoidance, this is counter-productive and results in exactly the opposite. As such, it is a good idea to disable it.

Full information about disabling THP can be found here: https://access.redhat.com/solutions/46111

In short, to disable THP on the fly, run this BEFORE starting aerospike:

## redhat
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
## all other distros
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

To verify that the THP has been disabled, you can ‘cat’ the above files. The output will look similar to this:

# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

This shows the 3 acceptable values are: always, madvise and never. The value in square brackets is the currently active value. Therfore THP is disabled (set to never).

To perform the change permanently, you can edit the grub configuration and add the following to the end of the ‘kernel …’ line: transparent_hugepage=never

An example grub conf could look like this (note, only the part mentioned above has been added, no other modification has been made):

menuentry 'Linux Mint 18.1 MATE 64-bit' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-d0bcecde-6975-4a78-a9d5-a6051d9e1223' {
	recordfail
	load_video
	gfxmode $linux_gfx_mode
	insmod gzio
	if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
	insmod part_msdos
	insmod btrfs
	set root='hd0,msdos1'
	if [ x$feature_platform_search_hint = xy ]; then
	  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1  d0bcecde-6975-4a78-a9d5-a6051d9e1223
	else
	  search --no-floppy --fs-uuid --set=root d0bcecde-6975-4a78-a9d5-a6051d9e1223
	fi
        linux	/@/boot/vmlinuz-4.4.0-79-generic root=UUID=d0bcecde-6975-4a78-a9d5-a6051d9e1223 ro rootflags=subvol=@  quiet splash $vt_handoff transparent_hugepage=never
	initrd	/@/boot/initrd.img-4.4.0-79-generic
}

The less-relevant (to asd) part about shared memory calculation and RSS

If process 1 uses 100GB of memory, process 2 uses 100GB of memory, then total use is 200GB, 100GB per process. Those processes will also show 100GB RSS used each.

Now let’s add shared buffers to the calculations. Let’s say there is 50GB of shared buffers.

Now say that process 1 touched all 50GB and process 2 touched only 20GB.

Process 1 will have touched 30GB that process 2 didn’t. That leaves us with 20GB that both processes touched.

The way linux will calculate that is by doing 20GB/2=10GB per process.

Process 1 RSS will show 100GB + 30GB of shared it touched on it’s own + it’s equal share of bytes of the 20GB it touched with another process = 100+30+10=140GB.

Process 2 is 100GB plus it’s equal share in bytes of what it touched with the other process (10GB) = 110GB.

Keywords

LINUX RAM MEMORY VSZ RSS JEMalloc shared THP Hugepages

Timestamp

10/24/17


Disabling transparent huge pages (THP) for Aerospike