Aerospike in-memory DB uses much memory than expected

Hello, I have Aerospike data base configured to keep data in memory. My cluster contains from 5 servers and has 2 name spaces.

First name space: memory-size 4G high-water-memory-pct 80

Second name space memory-size 17G high-water-memory-pct 80

So I expect to have occupied ~ 17Gb RAM.

As fast Aerospike 21.5Gb and this is highly unexpected.

P.s. Cluster evicted data by hwm policy, because capacity to enough to match provided TTL.

Have you read Capacity Planning Guide | Aerospike Documentation?

Also a user, @GeertJohan, has contributed a tool for capacity planning discussed here:

Thank you for the reply. Will check it.

I forgot to mention, if I stop instance and start it back, memory usage drops to 19Gb

Also based on info data memory usage is: ns1: 3.189 GB ns2: 12.737 GB

So I guess that extra 2,5Gb are taken by “dead” records after they were evicted. The question is how to force Aerospike to release this space with out data base stop/start procedure.

Can you share your namespace configuration stanzas?

namespace ns1 {

    replication-factor 2
    memory-size 4G
    high-water-memory-pct 80
    stop-writes-pct 99
    default-ttl 2d

    migrate-sleep 0
    evict-tenths-pct 10
    evict-hist-buckets 50000

    storage-engine memory
}

namespace ns2 {

    replication-factor 2
    memory-size 17G
    high-water-memory-pct 75
    stop-writes-pct 99
    default-ttl 10d

    migrate-sleep 0
    evict-tenths-pct 10
    evict-hist-buckets 50000

    storage-engine memory
}

Did you build any secondary indexes? Evicted records for data-in-memory will immediately release memory back to the process (for CE, primary index is also stored in RAM). Total RAM consumed by Aerospike process is about ~1GB for process itself, RAM for PI (64 bytes * number of records on that node, master or replica), RAM for data plus overhead per record which is explained in the linux capacity planning page (calculate RAM for the cluster, divide by number of nodes for each node usage). Then additional RAM for optional secondary indexes. You can see all SIs that you have in AQL…

$aql

aql> show indexes

For calculating memory consumed by SIs, you will need cardinality and number of records indexed by each SI. You can get that by (for example, namespace is ns1, index name is my_index1:

$asinfo -v ‘sindex/ns1/my_index1’

In the output, keys= bin cardinality, entries = number of records indexed …

So look at you number of records, replication factor = 2 in your case, size of data in your records and number of bins in your records and calculate your full RAM usage for the cluster, divide by number of nodes (assuming identically sized namespaces on all nodes) and get each node usage.

@pgupta When we say ns1 memory-size 4G, are we only setting this 4gb for data or does it include primary and secondary index as well?

as par this document it includes indexes as well, in the case we should not go above sum of all namespace memory-size and 1 gb for aerospike right?

memory-size 4G is for all memory consumed for PI, SI, and data if being stored in memory (storage-engine memory, or, storage-engine device with data-in-memory true). Memory used by PI is allocated in index-stage-size chunks (1 GB default) and is not given back to the system once allocated until you coldstart the node.

Memory used by sprigs, default is relatively small (~4MB per node for 8 node RF=2 cluster), and it does not count towards this number.

This number is used to calculate high water marks for evictions, (if finite TTL) and stop-writes. In the above example, it is assumed that the system actually has 4GB available for the aerospike process; aerospike will not validate that config value against total system memory at startup.

In addition, account for the replication factor and over-provision for at least one node out of the cluster for doing a rolling upgrade.

Hi @pgupta I am still confused. what we are saying is in in memory node, the memory used by aerospike is sum of all namespaces size + memory used by spring + 1 gb. What all is not accounted here?

in my perspective first if we are using in memory cluster, we would want to provision for as much memory as possible for my namespaces.

If you have multiple namespaces, sum for the cluster after computing for each. The memory-size setting of the namespace config is used by the server to trigger evictions and stop-writes.

And, as you note, the memory used by Aerospike process (~1GB) per node, is a good estimate, included default SPRIG (256) memory. However, remember, min RAM allocated for PI (arena stage) is in index-stage-size chunks - 1 GB default - each namespace has its separate arena stages.

Finally, if you get node sizing using a 3 node cluster as per above slides, then deploy a 4 node cluster. 4th node allows extra headroom for rolling upgrades.

I would suggest you review the sizing module video recording of Introduction to Aerospike course in Aerospike Academy. Even CE users get access to the Intro course.

Side note - in SSD storage estimate, in current server version (6.x+), there is an extra 4byte overhead per record stored on device. NA for memory storage.

Thanks @pgupta for elaborate reply.

  1. I agree on the estimation part
  2. i agree on keeping head room for node failure in each cluster.

But what is unclear is

ns1 4gb

ns2 4gb

aerospike process 1 gb.

Even if primary indexes expand in 1 gb batches

Based on above calculation what we are saying is aerospike process should not report more than 11gb in res memory of linux free -m command. And it should not go above 11 especially when only 50% of allocated 4gb is used by namespace (shown in AMC dashboard)

you are going in very depth of data size estimation which is relevant when we are creating a new cluster, but my question is simpler let’s say I have a box of 40 gb. how much of 40gb can I safely allocate to namespaces without running into OOM issues?

I had a cluster with 96gb ram. With 93 gb allocated to 2 namespaces(90gb and 3gb). The 3gb namespace did not have any data and another one had 45gb data. Still the node got killed with OOM. I am not able to understand how much headroom i should have given. Had indexes been outside namespace storage size i would have been extra cautious, But now i am confused.

In disk based datastores, keeping some head room is suggested so that pages can be cached. but when we are using in memory storage engine there is little value in giving more headroom than what is required. and considering memory is costlier, extra headroom is a waste of resources.

What was the highest number of master records that the cluster ever saw? I assume RF=2?

sure, you can set eviction HWM higher - data model choice.

i am not sure of max number fo record at any point of time, but at that time there were around 90M records. and yes RF=2

Cluster RAM 96 gb… how many nodes? Per node RAM? … then PI for ~16 million records = 1 GB = 1 arena stage. 90 million is master records? 90/16 = ~6 GB… so about 6 x RF= 2 = ~ 12 GB tied up in arena stages. If ever before you had higher number of records in a node, that arena stage allocation is still tied up and cannot be used for record data. So, off the bat, 45 gb data, still has enough headroom to 90 gb for arena stages. Look through the logs and see if you can spot anything why you OOM’d. Any SIs? Plus you would have hit stop-writes at 90% of 90GB before you got OOM’d due to incoming writes but that won’t stop replica, defrag and migration writes. What is your stop-writes-pct set at? 90% default?

@pgupta Please find answers inline.

how many nodes?

8

Per node RAM?

96gb

90 million is master records?

yes

we had only set high-water-disk-pct 80, so stop-writes should be default.

Look through the logs and see if you can spot anything why you OOM’d

in the dmsg

[Mon Jun 27 22:11:12 2022] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[Mon Jun 27 22:11:12 2022] [10910]     0 10910  1201522    26742     152       8        0             0 java
[Mon Jun 27 22:11:12 2022] [15621]     0 15621   163807    69233     302       6        0             0 amc
[Mon Jun 27 22:11:12 2022] [27021]   999 27021    11988     2351      29       3        0             0 telemetry.py
[Mon Jun 27 22:11:12 2022] [27024]     0 27024 31509559 24094610   60202     124        0             0 asd

[Mon Jun 27 22:11:12 2022] Out of memory: Kill process 27024 (asd) score 946 or sacrifice child
[Mon Jun 27 22:11:12 2022] Killed process 27024 (asd) total-vm:126038236kB, anon-rss:96378440kB, file-rss:0kB, shmem-rss:0kB

Plus you would have hit stop-writes at 90% of 90GB before you got OOM’d due to incoming writes but that won’t stop replica, defrag and migration writes.

It never got till 90%, attaching logs before dying. other 3gb namespace is empty.

Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992510013,0,0,3020503)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282178096,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242626222 total) msec
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614964169 total) msec
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282190261 total) msec
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195924739 total) count
Jun 27 2022 16:45:52 GMT: INFO (drv_ssd): (drv_ssd.c:2185) {ourdb} /dev/vdb: used-bytes 60448233088 free-wblocks 2008439 write-q 0 write (548457506,18.5) defrag-q 0 defrag-read (548368576,2.7) defrag-write (265090565,1.3)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:389) {ourdb} objects: all 34125984 master 13310128 prole 20815860 non-replica 0
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:444) {ourdb} migrations: remaining (256,215,512) active (1,1,0) complete-pct 5.42
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:465) {ourdb} memory-usage: total-bytes 63183673505 index-bytes 2184062976 sindex-bytes 2204445346 data-bytes 58795165183 used-pct 65.38
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:517) {ourdb} device-usage: used-bytes 60702210544 avail-pct 95
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:585) {ourdb} client: tsvc (0,3) proxy (555,0,32) read (39500829388,0,0,741801912) write (8614945735,19783,614) delete (12454423,0,1,17672660) udf (0,0,0) lang (0,0,0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992510767,0,0,3020503)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282186649,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242631300 total) msec
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614965518 total) msec
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282198814 total) msec
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195926482 total) count
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:389) {ourdb} objects: all 34264857 master 13310151 prole 20954707 non-replica 0
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:444) {ourdb} migrations: remaining (250,208,500) active (1,1,0) complete-pct 8.03
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:465) {ourdb} memory-usage: total-bytes 63439306974 index-bytes 2192950848 sindex-bytes 2209511876 data-bytes 59036844250 used-pct 65.65
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:517) {ourdb} device-usage: used-bytes 60951649392 avail-pct 95
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:585) {ourdb} client: tsvc (0,3) proxy (555,0,32) read (39500831663,0,0,741802051) write (8614946557,19783,614) delete (12454424,0,1,17672660) udf (0,0,0) lang (0,0,0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992511463,0,0,3020503)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282192374,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242633714 total) msec
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614966340 total) msec
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282204539 total) msec
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195927460 total) count
Jun 27 2022 16:46:12 GMT: INFO (drv_ssd): (drv_ssd.c:2185) {ourdb} /dev/vdb: used-bytes 60958933328 free-wblocks 2007948 write-q 0 write (548458161,32.8) defrag-q 0 defrag-read (548368742,8.3) defrag-write (265090646,4.1)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:389) {ourdb} objects: all 34386265 master 13310337 prole 21075928 non-replica 0
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:444) {ourdb} migrations: remaining (246,204,492) active (0,0,0) complete-pct 9.64
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:465) {ourdb} memory-usage: total-bytes 63669232448 index-bytes 2200720960 sindex-bytes 2213696484 data-bytes 59254815004 used-pct 65.89
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:517) {ourdb} device-usage: used-bytes 61176408480 avail-pct 95
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:585) {ourdb} client: tsvc (0,3) proxy (555,0,32) read (39500840040,0,0,741802370) write (8614948785,19783,614) delete (12454426,0,1,17672660) udf (0,0,0) lang (0,0,0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992512736,0,0,3020503)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282207925,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242642410 total) msec
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614968568 total) msec
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282220090 total) msec
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195929923 total) count

@pgupta I have replied with all the data points but apparantly Akismet bot has filtered it for review, Not sure if you are able to see.

I took care of Akismet. In the meantime, would you kindly review this link about disabling transparent huge pages and see if that is affecting you. Also, please review this link on min_free_kbytes.

In our docs, please review this link on best practices.

The two slides below summarize the issue also.

The thing with THP and other OS cache - it will appear as free memory since it can be deallocated for use. But it is critically not available at the moment when the server tries to malloc memory. After the malloc fails, the kernel recognises memory pressure. So some cached memory will get freed up by another thread, but that is too late and we’ve already OOMed.