Free memory doesn't change after deleting record

jemstats

#1

Hi,

I have 6 node cluster. Each node is a bare metal machine with 24 core and 256 GB RAM and 10 Gbps Network running on CentOS 6.7. I am using RAM-based storage with HDD. I have some secondary indexes as well. I have noticed that “Free memory” (free -h) doesn’t go down even after deleting records. I have deleted 1/3rd of my records. Total Cluster size (record count and RAM uses) also reduce to 1/3rd but on each node free memory has not increased. ASD process is still consuming the same amount of RAM. AMC is also showing fewer RAM uses per node. When I restarted ASD process on a node, free memory goes to correct value on that node. Do I need to restart each node again to reclaim free RAM?


#2

The 64 bytes used for the primary index is never freed back (the space will be reused by Aerospike). Sindex memory is garbage collected so give that some time. If you have data-in-memory true, that should be released - though it is up tothe allocator (jemalloc) as to when it is to be released back to the OS.


#3

I have deleted 500M records. So 64*500x10^6 ~ 30GB. which is significant.

The 64 bytes used for the primary index is never freed back (the space will be reused by Aerospike)

will it be reused for index or something else?

  • though it is up tothe allocator (jemalloc) as to when it is to be released back to the OS

can I force it to run now?


#4

It will be reused for primary index.


#5

Even if it doesn’t release Primary index memory but it should still release other occupied memory. I didn’t see any change in RAM uses by Aerospike process on machines.


#6

Could you share your namespace configuration? Also could look into jemstats.

(appears the kb for jemstats isn’t finished - following is an excerpt of the draft)

[AER-5683] - (KVS) New memory subsystem with configurable allocation debugging.

This is what memory debugging offers:

  • Memory accounting: We keep track of which code location has allocated how much memory. This helps in the case of memory leaks.
  • Double-free detection and corruption detection: This helps us find issues with Aerospike code, where there is potentially improper manage allocations or some parts of the code making different assumptions about the allocations than other parts causing unexpected bugs.

User Interface

There are two ways to interact with the memory subsystem as an Aerospike user.

1. debug-allocations configuration:

By default, this is disabled. Enabling this would assert on double frees and corruption detections which otherwise could go unnoticed. There is a potential risk when enabling this.

The configuration in the service stanza of the Aerospike configuration file and is a static configuration (requires service restart).

It takes a one of the following four values as a parameter:

  • none - This completely disables any instrumentation of the allocation API: memory accounting, buffer overflow detection, double free detection. Instead, we simply forward any API calls directly to JEMalloc. In particular, this removes our 4-byte memory overhead per allocation.
  • all - This enables instrumentation for all allocations and, thus, incurs the 4-byte memory overhead for all allocations.
  • transient - This enables instrumentation only for transient - i.e., short-lived - allocations. Technically speaking, this exempts allocations from namespace arenas, whereas allocations from a thread’s default arena are covered by the instrumentation.
  • persistent - This is the complement to transient. This setting enables instrumentation for allocations from namespace arenas, whereas allocations from a thread’s default arena are exempt.

2. jem-stats asinfo command:

This is used to dump JEMalloc’s internal statistics as well as our own memory accounting information, i.e., our site_info records. Analyzing the output does require scripting using Python.

Syntax:

asinfo -v 'jem-stats:file=[...];options=[...];sites=[...]'

The first two options to the info command concern JEMalloc’s internal statistics:

  • The file option specifies the file to receive JEMalloc’s internal statistics. The internal statistics are dumped via JEMalloc’s jem_malloc_stats_print(). If no file option is specified, the internal statistics are dumped to the Aerospike log file.

  • The options option is simply forwarded to the options parameter of jem_malloc_stats_print().

  • If specified, the sites option indicates that we would also like to dump the site_info records for all threads and all allocation sites. The option specifies the file to which to dump this information. The resulting file contains lines like the following, each of which describes one site_info record.

0x00000000005126c4      8762 0x0000000000000000 0x0000000000000240
0x00000000005126c4      8763 0xffffffffffffffff 0xfffffffffffffdc0

The four columns have the following meaning:

  • The code address of the allocation site of the site_info record.
  • The thread ID of the thread owning the site_info record.
  • The upper 64 bits of the 128-bit size in the site_info record (size_hi member).
  • The lower 64 bits of the size (size_lo member).

In the above example, we’re looking at the following allocation site:

$ addr2line -e /usr/bin/asd 0x5126c4
/home/xyz/as/aerospike-server/as/src/base/system_metadata.c:729

It looks like one thread, thread 8762, allocated 0x240 = 576 bytes at system_metadata.c:729. And another thread, thread 8763, seems to have deallocated these 576 bytes, resulting in a size member of -576 = 0xffff…fdc0.

Note: Further processing of the dumped raw site_info data would typically be done with, say, a Python script. At this stage, Aerospike daemon itself doesn’t contain any facilities for further analysis.

Sample script: There is a Aerospike developed Python script that does this which is available packaged with Aerospike Tools package.

Usage: To translate “input-file” (which was obtained via “jem-status:” and which has memory addresses) into “output-file” (which has human-readable file names and line numbers).

asparsemem -a /usr/bin/asd -i input-file -o output-file

#7
service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        #service-threads 24
        #transaction-queues 24
        transaction-threads-per-queue 4
        proto-fd-max 30000
        transaction-pending-limit 0
        auto-pin cpu
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address int4
                port 3000
                access-address int4
        }

        heartbeat {


#               mode multicast
 #              address  224.0.0.116 # 239.1.99.222
  #                     port 9918

    mode mesh
    port 3002

    mesh-seed-address-port ....
    mesh-seed-address-port ....
    ...
    ...
    ...
    ...


    interval 150
                timeout 10
        }

        fabric {
                address int4
                port 3001
        }

        info {
                port 3003
        }
}


namespace test {
        replication-factor 2
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine memory
}

#production namespace
namespace Production {
  replication-factor 2
  memory-size 242G
  default-ttl 0 # 30 days, use 0 to never expire/evict.
        high-water-disk-pct 50 # How full may the disk become before the
                               # server begins eviction (expiring records
                               # early)
        high-water-memory-pct 85 # How full may the memory become before the
                                 # server begins eviction (expiring records
                                 # early)
        stop-writes-pct 90  # How full may the memory become before
                            # we disallow new writes
        partition-tree-sprigs 4096
        partition-tree-locks 256
  # storage-engine memory
  storage-engine device {
                #device /dev/sdb1
                #data-in-memory false

    file /opt/aerospike/data/prod.data
    filesize 1000G # 8 times of RAM
    data-in-memory true

                #write-block-size 128K   # adjust block size to make it efficient for SSDs.
                # largwst size of any object
        }
}

#8

Did the servers resident memory decrease? Maybe the memory is being occupied by filesystem cache - could you provide the output of free -h and pmap $(pgrep asd)?


#9
~]$ free -h
             total       used       free     shared    buffers     cached
Mem:          252G       249G       2.4G       504K        84M        17G
-/+ buffers/cache:       232G        20G
Swap:           0B         0B         0B

~]$  pmap $(pgrep asd)
16912:   /usr/bin/asd --config-file /etc/aerospike/aerospike.conf
 total                0K

Screenshot of AMC for same node:


#10

Nishnant,

The source and interface to JeMalloc is available, as is Aerospike. If you’d like to change the source code to change the policy, it is all available to you.

If you have a specific production or use case reason to want to free memory back to the OS - instead of applying the OS and allocator policies that are working very very well for a majority of our production customers ( JEMalloc has been a standout winner for us in terms of reducing and removing memory fragmentation problems ), please let us know about that.


#11

@bbulkow My main concern is that if OS free memory will be less (doesn’t mean cluster capacity is less becasue aerospike hasn’t released memory), then in case of node failure it might be OOM and eventually node will crash. I have already faced this issue earlier so want to avoid it in future.