Memory usage keeps increasing without increase in keys

#1

I’m using aerospike build 4.5.2.1 running on Debian 9 VMs. Every record being written contains the following bins:

PK: string (always < 256 bytes)

Blob: byte array (2-4 MB)

ContentType: string (< 16 bytes)

I’m storing each record in 2 separate namespaces with different ttl (segments-disk stored on SSD with a 3 hour ttl and segments-memory stored in RAM with a 3 min ttl). This is configured as a 3 node cluster running on GCE VMs. There is a constant network ingress of 22 MBps and 16 writes per second to segments-memory, and all objects in memory are being stored with a 3 min ttl (configured in the application). As expected, I see the number of objects stay constant around 2900 (16 writes/sec * 180 sec) after the first 3 minutes of running. I configured nsup to run every 30 seconds to reclaim memory faster.

However, I am observing a slow but constant increase in memory usage of about 1.3 kB/s even though the number of keys and rate of ingest are not changing.

aerospike-memory-usage

It looks like about 80 bytes of memory is leaking per write to the segments-memory namespace. While this smells like a memory leak, I’m not sure if there is something in my configuration that is causing memory to not be freed up. Any help on this would be greatly appreciated. Pasting my aerospike.conf below.

# Aerospike database configuration file.

# This stanza must come first.
service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	proto-fd-max 15000
}

logging {

	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}

	# Send log messages to stdout
	console {
		context any info
	}
}

network {
	service {
		address any
		port 3000
	}

	heartbeat {
		mode mesh
		port 3002

		# use asinfo -v 'tip:host=<ADDR>;port=3002' to inform cluster of
		# other mesh nodes

		mesh-seed-address-port aerospike-server-v7-1 3002
		mesh-seed-address-port aerospike-server-v7-2 3002
		mesh-seed-address-port aerospike-server-v7-3 3002

		interval 250    # milliseconds between successive heartbeats
		timeout 20      # number of missing heartbeats after which node is declared dead
	}

	fabric {
	    address any
		port 3001       # Intra-cluster communication port (migrates, replication, etc)
	}
}

namespace segments-memory {
	replication-factor 1
	memory-size 6G          # assuming 7.5G VM, increase if using a bigger VM
	default-ttl 1d          # upper bound of in-memory storage. Note: application should set a lower ttl to prevent OOM

    conflict-resolution-policy last-update-time

	storage-engine memory

    nsup-period 30
}


namespace segments-disk {
	replication-factor 2
	memory-size 1G          # assuming 7.5G VM, increase if using a bigger VM
	default-ttl 5d          # default ttl 5 days. Note: application should customize this and not rely on default value

	conflict-resolution-policy last-update-time

	#	storage-engine memory

	# To use in-memory storage, comment out the lines below and uncomment the line above

	storage-engine device {
	    device /dev/sdb     # ssd path

        scheduler-mode noop     # This line optimizes for SSD
        write-block-size 8M     # this limits the maximum size of a record. recommended value for ssd is 128K but that
                                # means we cannot store objects bigger than 128 KB

		data-in-memory false # Store data in memory in addition to file.
	}
}

Another thing I noticed was that the memory usage reported by AMC for both the namespaces combined is significantly lower than what is reported by top. AMC reports about 25% memory in use whereas top reports asd consuming about 47% memory. Is this expected?

#2
  1. How long have these node been running?
  2. Could you provide the output of:
    asadm -e "summary"
    asadm -e "show stat"
    

I’m not sure which stat you are looking at, I believe the stat is likely based off of the configured memory-size and only accounts for namespace index and object memory. Could you provide the specific stat.

#3

Sure

  1. How long have these node been running?

uptime: 3 day(s) 00:55:45

$ asadm -e "summary"
Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/anirudh/.aerospike/astools.conf, /etc/aerospike/astools.conf
Cluster
=======
   1.   Server Version     :  C-4.5.2.1
   2.   OS Version         :  Debian GNU/Linux 9 (stretch) (4.9.0-8-amd64)
   3.   Cluster Size       :  3
   4.   Devices            :  Total 3, per-node 1
   5.   Memory             :  Total 21.000 GB, 12.43% used (2.610 GB), 87.57% available (18.390 GB)
   6.   Disk               :  Total 1.099 TB, 24.96% used (280.814 GB), 70.67% available contiguous space (795.000 GB)
   7.   Usage (Unique Data):  2.472 GB in-memory, 140.408 GB on-disk
   8.   Active Namespaces  :  2 of 2
   9.   Features           :  KVS, Scan
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespaces~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Namespace            Devices                       Memory                        Disk   Replication    Rack      Master           Usage           Usage   
              .   (Total,Per-Node)         (Total,Used%,Avail%)        (Total,Used%,Avail%)        Factor   Aware     Objects   (Unique-Data)   (Unique-Data)   
              .                  .                            .                           .             .       .           .       In-Memory         On-Disk   
segments-disk     (3, 1)             (3.000 GB, 1.00, 99.00)      (1.099 TB, 24.96, 70.67)              2   False   164.109 K        0.000 B       140.408 GB   
segments-memory   (0, 0)             (18.000 GB, 14.33, 85.67)    (0.000 B, 0.00, 0.00)                 1   False     2.908 K        2.472 GB        0.000 B    
Number of rows: 2

$ asadm -e "show stat" https://pastebin.com/cBtQDRY6 (Too big to paste and attach does not work)

I’m not sure which stat you are looking at, I believe the stat is likely based off of the configured memory-size and only accounts for namespace index and object memory. Could you provide the specific stat.

See the output of asadm -e "summary" for example. This reports 12.5% memory usage whereas top reports asd consuming about 47% memory.

top - 17:51:56 up 3 days, 3 min,  1 user,  load average: 0.08, 0.12, 0.12
Tasks:  84 total,   1 running,  83 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.4 us,  1.5 sy,  0.0 ni, 93.2 id,  1.9 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7663572 total,  2709000 free,  3957660 used,   996912 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3359932 avail Mem 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                         
 3554 root      20   0 8986872 3.417g   6952 S   5.0 46.8 210:31.64 asd                                                                                                                                                                                                                             
 3569 root      20   0  120468  68660  13340 S   3.3  0.9  14:32.28 amc                                                                                                                                                                                                                             
 3009 dd-agent  20   0  331836  57620  10408 S   1.7  0.8   7:21.98 python                                                                                                                                                                                                                          
  409 root      20   0       0      0      0 S   0.3  0.0   0:08.54 jbd2/sda1-8 

I imagine this is because the 12.5% does not include indexes and the memory allocated as part of in-flight read/write requests? Is there anything else that accounts for the discrepancy?

#4

The stats agree that about 50% of the memory is being used. The primary index usage for each namespace reports:

memory_used_index_bytes                    :   7052352                                            7047936                                            6972608                                            
memory_used_index_bytes                      :   70208                                              64384                                              63680                                              

The primary index allocates memory in 1 GiB slabs however Linux doesn’t give the process the memory until it actually uses it. I suspect that this discrepancy is occurring when the index uses a previously unused page of memory. Assuming number of records do not increase, this should cap the amount the memory will grow to about 2 GiB.

#5

Thanks. Any idea why the memory usage keeps increasing even though the network ingress and writes per second are constant and all records are being stored with a 3 min ttl in segments-memory aerospike-mem-usage-vs-network-ingress

#6

This is what I was trying to explain in the last paragraph. The primary-index allocates a 1 GiB slab at a time, both namespaces are currently accessing a very small fraction of this space but any portion of this space can be accessed by Aerospike. Linux will provide the process with this memory (thus updating accounting) as the process accesses memory pages that it previously allocated but not yet used. I hope that clarifies the last paragraph from my previous post.

1 Like
#7

Thank you, I will keep monitoring this and will reply back with the results when this reaches the 2 GiB mark. Hopefully it flattens out.

#8

When you look at your namespace configuration, you are configuring 1G and 6G between the two namespaces. So 7G per node, 21G total. From a config. point of view, you are telling Aerospike that my hardware has 7G available per node. Aerospike does not check your hardware to see what you actually have. So when you add objects, it sums the memory consumed for Primary Index (always 64 bytes per object) + the size of object in memory - that total as a percentage of configured total (21GB) is 12.5% consumed. (You actual memory used is 2.6/3 = ~.86 gb per node for the PI and objects, per asadm output). When objects in memory expire, the memory is freed right away. (OK, you have 3 min ttl on memory and nsup at 30 sec. so thats good.)

On the other hand, top tells you actual memory consumed by full asd process - so that includes memory allocated for PI, objects, and the process itself (~1 GB) + adjust for quantum allocation chunks. It then compares with total system memory (actual on your hardware) and give the percent number. Then as @kporter explained, Aerospike reserves memory from hardware in chunks. In CE, I believe it reserves in 128 MiB chunks and not 1 Gib … but I may be wrong. @kporter will correct me. :slight_smile:

I still don’t see why your memory consumption should keep rising steadily - is your ttl actually being set correctly? Do a ttl histogram dump on the all the objects of both the memory and disk namespace.

$ asinfo –v “histogram:type=ttl;namespace=segments-memory”

$ asinfo –v “histogram:type=ttl;namespace=segments-disk”

Also check nsup-period for segments-disk - by default it should be 120 sec. If it was set to zero dynamically by mistake, nsup will not clear the expired PI for records on disk, (ttl = 3 hours), and that could explain the 1KB/sec increase at 16 writes/sec. Finally, in your default-ttl comments - note about application setting ttl. In general, it is best to select the correct default-ttl in the configuration and never muck with it via the application.

#9

It turns out the metric for memory usage I was looking at includes cached/buffered memory. In terms of the output of free -m, it is equal to total - free (and not the used memory reported by free -m). It looks like the used memory remains more or less constant, but the cached memory keeps increasing with time.

aerospike-cached-memory-april-24

I had to restart the aerospike cluster yesterday to test some unrelated changes so I haven’t been able to verify yet if the cached memory flattens out or ultimately causes an OOM.

It looks like the ttls are being set correctly:

$ asinfo -v "histogram:namespace=segments-memory;type=ttl"
units=seconds:hist-width=200:bucket-width=2:buckets=16,8,19,7,9,13,14,9,7,15,10,13,5,9,5,17,5,15,10,15,7,11,8,8,11,10,9,7,3,8,12,4,12,4,15,10,11,14,2,10,5,15,10,11,12,11,13,10,6,10,11,11,17,12,7,9,9,9,6,8,2,16,8,6,11,6,13,6,8,11,13,15,10,7,8,8,7,10,8,13,6,14,8,7,10,11,9,13,7,12,3,0,0,0,0,0,0,0,0,0

$ asinfo -v "histogram:namespace=segments-disk;type=ttl"
units=seconds:hist-width=10800:bucket-width=108:buckets=1140,1103,1137,1107,1073,1109,1087,1117,1091,1098,1080,1112,1100,1096,1109,1106,1098,1117,1118,1101,1094,1091,1117,1084,1097,1101,1079,1091,1081,1109,1082,1104,1066,1124,1103,1120,1127,1114,1085,1084,1096,1130,1138,1110,1112,1094,1113,1101,1057,1119,1124,1093,1091,1126,1083,1071,1095,1119,1075,1127,1091,1123,1118,1129,1087,1109,1155,1054,1107,1097,1100,1105,1062,1080,1092,1083,1127,1137,1050,1069,1086,1101,1081,1112,1118,1101,1102,1094,1115,1133,1089,1091,1103,1120,1106,1078,1106,1098,1101,1073

I verified using asinfo that the nsup-period for segments-disk is 120.

The way we’re setting the ttl in the application is to set it per record (and not by changing the default ttl for the namespace). The reason for doing this is that we’re fairly new to using aerospike for our storage needs and we’d like the ability to tune the ttl without having to redeploy the aerospike server cluster every time. Redeploying the application is fairly cheap and fast whereas we haven’t gotten to a stage yet where we can deploy the aerospike cluster as fast without loss of persistence. Is there a significant performance hit of using a per-record ttl as opposed to the default namespace ttl?

#10

You can change the default-ttl of a namespace dynamically using asinfo - set-config. You don’t have to restart the cluster. There is zero performance hit when you change ttl from the application. But both ways are a bad practice for other reasons. You can create some problems if you reduce ttl of a record on disk, in a record update, to below its remaining life on the previous copy, and restart the node, before defrag thread can clear the older copy. … regardless of whether you do from application or changing default-ttl dynamically.

After 3 hours and 2 minutes, your memory usage should be basically stable. (I estimate you are writing about 500 write-blocks per minute - so with default post-write-queue of 256 blocks, that memory consumption would be stable within a minute.)

Trust your application is running on nodes different than Aerospike nodes and not interfering with memory usage.

Also are you inserting new records (different keys) continuously or updating a fixed number of records (keys)? That will further cap the Primary Index memory usage which is fixed 64 bytes per key.

#11

Correct, the application is not running on these nodes.

We’re continuously inserting new records and not updating existing records. Is the primary index for expired records also cleared every nsup-period or is this a slower operation?

The cached memory graph I shared above is roughly a 29 hour snapshot, and it looks like it continuously keeps increasing while the number of objects are constant. This is a graph of cached memory and number of objects after starting up the cluster:

The number of objects in segments-memory and segments-disk become constant after ttl time of starting up the cluster but the cached memory keeps increasing. As mentioned before, I haven’t given it enough time to see if this becomes constant after some time or will cause an OOM.

#12

Yeah, thats what nsup does - clears the memory used of expired records’ PI and object memory for data-in-memory. A separate disk defrag thread finds all records not being pointed to by a PI and eventually recovers the disk space. You might want to see what happens if you disable nsup – set period to zero as a short experiment.

#13

You are wrong :slight_smile:. In CE, if the namespace is at least 1 GiB in size then it allocates in 1 GiB slabs.

If this were the case, he they would also see the object counts increasing.

Curious, for what purpose?

#14

The system will not OOM from cached memory. If a process request memory and the kernel is out of free memory to provide, it will take it from the cache. So this memory is almost like free memory except that the kernel cannot access it at during a hardware interrupt. We have seen full memory due to cache result in poor network performance because the kernel wasn’t able to obtain memory for the NIC during an interrupt.