When running out of RAM, the DB can paint itself into a corner

gglawits · December 14, 2021, 6:18pm

I was playing around with the community edition 5.7.0.8 on a 4-core Ryzen 3000G, 16 GB RAM system running Ubuntu 20.04.

Being a fast DB, I found it easy to add millions of records (each of them only containing a key and one bin element) at a pace of roughly 64000 randomly generated records per second - awesome!

Now for the bad part: At some point, the management tool showed 9.6 GB of data in the table and 4.7 GB of RAM occupied by the index. (I had the index on my one and only bin in RAM, which seems to be the default).

Suddenly, inserting new records slowed to a crawl. I launched “top” and found out that the system was swapping heavily and that Aerospike had a RAM footprint of more than the physical RAM in the system, 16 GB. Not good.

I tried to truncate (zap) my table with a 10-line program which calls the aerospike_truncate() API, whereupon Aerospike crashed.

I restarted it with “systemctl start aerospike” and after a few minutes of rebuilding the index in RAM it ran out of RAM again and crashed again, without any interaction.

Rinse and repeat.

I wound up nuking the entire backing store with “dd if=/dev/zero of=/dev/nvme0m1 bs=65536” - an option which doesn’t really exist, had this been a production database.

I suggest this issue should be reproduced and fixed or at least mitigated.

kporter · December 14, 2021, 7:55pm

Could you share the contents of /etc/aerospike/aerospike.conf.
Could you share the stack trace from the crash? This would appear in the server’s logs which will include the word “stack” and “frame”.

gglawits · December 14, 2021, 9:37pm

# Aerospike database configuration file for use with systems.

service {
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        proto-fd-max 15000
}

logging {
        console {
                context any info
        }
}

network {
        service {
                address 127.0.0.1
                port 3000
        }

        heartbeat {
                mode multicast
                multicast-group 239.1.99.222
                port 9918

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.

                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

#namespace test {
#       replication-factor 2
#       memory-size 4G
#
#       storage-engine memory
#}

namespace test {
        replication-factor 2
        memory-size 444G

        storage-engine device {
                device /dev/nvme0n1
                scheduler-mode noop
                write-block-size 128K
        }
}

namespace bar {
        replication-factor 2
        memory-size 4G

        storage-engine memory

        # To use file storage backing, comment out the line above and use the
        # following lines instead.
#       storage-engine device {
#               file /opt/aerospike/data/bar.dat
#               filesize 16G
#               data-in-memory true # Store data in memory in addition to file.
#       }
}

gglawits · December 14, 2021, 9:42pm

As for logs, I have no idea where it logs to.

I tried to run a find /var/log -type f -exec grep stack {} ; -print to no avail.

pgupta · December 14, 2021, 10:24pm

Server logging to console (default) can be extracted for specific time window to a file:

$ journalctl -u aerospike -a -o cat --since “2021-11-22 23:52:00” --until “2021-11-22 24:52:00” |grep GMT > /tmp/aerospike.20211122_23_52_00.log

You can also enable direct logging to a file in the logging stanza besides the console. But for now, you can view the log as:

$ journalctl -u aerospike -a -o cat -f

Options: -a (all output) -o cat (output as linux cat) –f or –n 100 (like tail command arguments)

Forcing System V type of logging to a file in a systemd distro:

logging {
# Add one or more file logging sub-context(s) ...
	file /var/log/aerospike/aerospike.log {
		context any info # or whatever context/level is desired
	}
}

pgupta · December 14, 2021, 10:32pm

I see you are defining memory-size 444G but your underlying system has only 16G.

gglawits · December 15, 2021, 12:00am

Good catch - I was thinking of storage size when I set this. Thanks. Will try to add too many records again and see what happens…

gglawits · December 15, 2021, 12:45am

Setting it to 16G still incurs heavy activity of kswapd Trying 15G now…

kporter · December 15, 2021, 1:47am

There probably won’t be a trace in the logs since it was likely the kernel’s OOM killer that killed the server process.

The logs should complain about a best-practice violation since the configured memory size exceeds the system’s capacity. You can enable enforce-best-practices, which will cause the server to fail to start if anything violates our best-practices.

It isn’t unexpected that configuring namespace’s memory-size to greater than or equal to the system’s capacity would lead to an OOM event. The documentation recommends reserving space for other namespaces (if defined) and for memory allocations outside of the namespaces (both in and out of the process). See the note in the memory-size docs link above - the memory-size setting isn’t a hard limit, it is used along with the stop-writes and eviction settings to calculate the applicable thresholds.

gglawits · December 15, 2021, 10:35pm

Yes, it was an honest mistake.

However, when I successively set the memory-size to 16G, 15G, 14G, 13G, 12G, and 11G, during heavy load from a stress test program, kswapd still kicks in eventually.

Now, with memory-size set to 10G, the DB crashes during startup. This is from the journal:

Dec 15 2021 22:26:38 GMT: INFO (as): (as.c:379) initializing services...
Dec 15 2021 22:26:38 GMT: INFO (service): (service.c:167) starting 20 service threads
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:792) updated fabric published address list to {192.168.1.81:3001}
Dec 15 2021 22:26:38 GMT: INFO (partition): (partition_balance.c:201) {test} 4096 partitions: found 0 absent, 4096 stored
Dec 15 2021 22:26:38 GMT: INFO (smd): (smd.c:2319) no file '/opt/aerospike/smd/UDF.smd' - starting empty
Dec 15 2021 22:26:38 GMT: INFO (batch): (batch.c:781) starting 4 batch-index-threads
Dec 15 2021 22:26:38 GMT: INFO (health): (health.c:318) starting health monitor thread
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:417) starting 8 fabric send threads
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:431) starting 16 fabric rw channel recv threads
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric ctrl channel recv threads
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric bulk channel recv threads
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:431) starting 4 fabric meta channel recv threads
Dec 15 2021 22:26:38 GMT: INFO (fabric): (fabric.c:443) starting fabric accept thread
Dec 15 2021 22:26:38 GMT: INFO (hb): (hb.c:7160) initializing multicast heartbeat socket: 239.1.99.222:9918
Dec 15 2021 22:26:38 GMT: INFO (fabric): (socket.c:815) Started fabric endpoint 0.0.0.0:3001
Dec 15 2021 22:26:38 GMT: INFO (socket): (socket.c:1579) Joining multicast group: 239.1.99.222
Dec 15 2021 22:26:38 GMT: INFO (hb): (hb.c:7194) mtu of the network is 1500
Dec 15 2021 22:26:38 GMT: INFO (hb): (socket.c:1615) Started multicast heartbeat endpoint 0.0.0.0:9918
Dec 15 2021 22:26:38 GMT: INFO (nsup): (nsup.c:188) starting namespace supervisor threads
Dec 15 2021 22:26:38 GMT: INFO (service): (service.c:939) starting reaper thread
Dec 15 2021 22:26:38 GMT: INFO (service): (socket.c:815) Started client endpoint 127.0.0.1:3000
Dec 15 2021 22:26:38 GMT: INFO (service): (service.c:199) starting accept thread
Dec 15 2021 22:26:38 GMT: INFO (info-port): (thr_info_port.c:298) starting info port thread
Dec 15 2021 22:26:38 GMT: INFO (info-port): (socket.c:815) Started info endpoint 0.0.0.0:3003
Dec 15 2021 22:26:38 GMT: INFO (as): (as.c:421) service ready: soon there will be cake!
Dec 15 2021 22:26:39 GMT: INFO (nsup): (nsup.c:933) {test} collecting ttl & object size info ...
Dec 15 2021 22:26:40 GMT: INFO (clustering): (clustering.c:6354) principal node - forming new cluster with succession list: bb9d88ffec28570
Dec 15 2021 22:26:40 GMT: INFO (clustering): (clustering.c:5794) applied new cluster key 8aa4ba7e6e52
Dec 15 2021 22:26:40 GMT: INFO (clustering): (clustering.c:5796) applied new succession list bb9d88ffec28570
Dec 15 2021 22:26:40 GMT: INFO (clustering): (clustering.c:5798) applied cluster size 1
Dec 15 2021 22:26:40 GMT: INFO (exchange): (exchange.c:2318) data exchange started with cluster key 8aa4ba7e6e52
Dec 15 2021 22:26:40 GMT: INFO (exchange): (exchange.c:2668) exchange-compatibility-id: self 10 cluster-min 0 -> 10 cluster-max 0 -> 10
Dec 15 2021 22:26:40 GMT: INFO (exchange): (exchange.c:3218) received commit command from principal node bb9d88ffec28570
Dec 15 2021 22:26:40 GMT: INFO (exchange): (exchange.c:3181) data exchange completed with cluster key 8aa4ba7e6e52
Dec 15 2021 22:26:40 GMT: INFO (partition): (partition_balance.c:1005) {test} replication factor is 1
Dec 15 2021 22:26:40 GMT: INFO (partition): (partition_balance.c:976) {test} rebalanced: expected-migrations (0,0,0) fresh-partitions 0
Dec 15 2021 22:26:48 GMT: WARNING (nsup): (nsup.c:875) {test} breached stop-writes limit (memory), memory sz:13144177875 (8002539776 + 0 + 5141638099 + 0) limit:9663676416, disk avail-pct:97
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:166) NODE-ID bb9d88ffec28570 CLUSTER-SIZE 1
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:247)    cluster-clock: skew-ms 0
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:268)    system: total-cpu-pct 60 user-cpu-pct 40 kernel-cpu-pct 20 free-mem-kbytes 526056 free-mem-pct 3 thp-mem-kbytes 0
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:290)    process: cpu-pct 42 threads (8,62,62,62) heap-kbytes (15161864,15162372,15413760) heap-efficiency-pct 98.4
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:300)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:323)    fds: proto (0,0,0) heartbeat (0,0,0) fabric (0,0,0)
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:332)    heartbeat-received: self 67 foreign 0
Dec 15 2021 22:26:48 GMT: INFO (info): (ticker.c:358)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (0,0)
Dec 15 2021 22:26:49 GMT: INFO (info): (ticker.c:417) {test} objects: all 125039684 master 125039684 prole 0 non-replica 0
Dec 15 2021 22:26:49 GMT: INFO (info): (ticker.c:481) {test} migrations: complete
Dec 15 2021 22:26:49 GMT: INFO (info): (ticker.c:502) {test} memory-usage: total-bytes 13144177875 index-bytes 8002539776 set-index-bytes 0 sindex-bytes 5141638099 used-pct 122.41
Dec 15 2021 22:26:49 GMT: INFO (info): (ticker.c:571) {test} device-usage: used-bytes 10003174720 avail-pct 97 cache-read-pct 0.00
Dec 15 2021 22:26:58 GMT: INFO (drv_ssd): (drv_ssd.c:1837) {test} /dev/nvme0n1: used-bytes 10003174720 free-wblocks 3586500 write-q 0 write (0,0.0) defrag-q 0 defrag-read (1,0.1) defrag-write (0,0.0)
Dec 15 2021 22:26:58 GMT: WARNING (nsup): (nsup.c:875) {test} breached stop-writes limit (memory), memory sz:13144177875 (8002539776 + 0 + 5141638099 + 0) limit:9663676416, disk avail-pct:97
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:166) NODE-ID bb9d88ffec28570 CLUSTER-SIZE 1
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:247)    cluster-clock: skew-ms 0
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:268)    system: total-cpu-pct 120 user-cpu-pct 101 kernel-cpu-pct 19 free-mem-kbytes 515348 free-mem-pct 3 thp-mem-kbytes 0
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:290)    process: cpu-pct 96 threads (8,62,62,62) heap-kbytes (15161864,15162372,15413760) heap-efficiency-pct 98.4
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:300)    in-progress: info-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:323)    fds: proto (0,0,0) heartbeat (0,0,0) fabric (0,0,0)
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:332)    heartbeat-received: self 139 foreign 0
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:358)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (0,0)
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:417) {test} objects: all 125039684 master 125039684 prole 0 non-replica 0
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:481) {test} migrations: complete
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:502) {test} memory-usage: total-bytes 13144177875 index-bytes 8002539776 set-index-bytes 0 sindex-bytes 5141638099 used-pct 122.41
Dec 15 2021 22:26:59 GMT: INFO (info): (ticker.c:571) {test} device-usage: used-bytes 10003174720 avail-pct 97 cache-read-pct 0.00
aerospike.service: Main process exited, code=killed, status=9/KILL
aerospike.service: Failed with result 'signal'.

gglawits · December 16, 2021, 1:08am

Here is a restart attempt where it actually shows a stack trace:

root@greg-desktop:/home/greg# systemctl start aerospike
root@greg-desktop:/home/greg# journalctl -u aerospike -a -o cat -f
Dec 16 2021 01:00:00 GMT: INFO (drv_ssd): (drv_ssd.c:3192) usable device size must be header size 8388608 + multiple of 131072, rounding down
Dec 16 2021 01:00:00 GMT: INFO (drv_ssd): (drv_ssd.c:3281) opened device /dev/nvme0n1: usable size 480103890944, io-min-size 512
Dec 16 2021 01:00:00 GMT: WARNING (hardware): (hardware.c:212) error while writing to file /sys/class/block/nvme0n1/queue/scheduler: 22 (Invalid argument)
Dec 16 2021 01:00:00 GMT: WARNING (hardware): (hardware.c:2627) couldn't set scheduler for /dev/nvme0n1 to noop
Dec 16 2021 01:00:00 GMT: INFO (drv_ssd): (drv_ssd.c:1048) /dev/nvme0n1 has 3662902 wblocks of size 131072
Dec 16 2021 01:00:00 GMT: INFO (drv_ssd): (drv_ssd.c:3099) {test} device /dev/nvme0n1 prior shutdown not clean
Dec 16 2021 01:00:00 GMT: INFO (drv_ssd): (drv_ssd.c:2667) device /dev/nvme0n1: reading device to load index
Dec 16 2021 01:00:05 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 6966648 device-pcts (0)
Dec 16 2021 01:00:10 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 12123101 device-pcts (0)
Dec 16 2021 01:00:15 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 16755427 device-pcts (0)
Dec 16 2021 01:00:20 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 20997530 device-pcts (0)
Dec 16 2021 01:00:25 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 24947996 device-pcts (0)
Dec 16 2021 01:00:30 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 28699887 device-pcts (0)
Dec 16 2021 01:00:35 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 32274673 device-pcts (0)
Dec 16 2021 01:00:40 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 35689250 device-pcts (0)
Dec 16 2021 01:00:45 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 39037019 device-pcts (0)
Dec 16 2021 01:00:50 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 42310304 device-pcts (0)
Dec 16 2021 01:00:55 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 45526426 device-pcts (0)
Dec 16 2021 01:01:00 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 48751588 device-pcts (0)
Dec 16 2021 01:01:05 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 51927000 device-pcts (0)
Dec 16 2021 01:01:10 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 55051274 device-pcts (0)
Dec 16 2021 01:01:15 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 58126068 device-pcts (0)
Dec 16 2021 01:01:20 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 61163007 device-pcts (1)
Dec 16 2021 01:01:25 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 64165531 device-pcts (1)
Dec 16 2021 01:01:30 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 67127637 device-pcts (1)
Dec 16 2021 01:01:35 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 70002398 device-pcts (1)
Dec 16 2021 01:01:40 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 72845651 device-pcts (1)
Dec 16 2021 01:01:45 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 75695692 device-pcts (1)
Dec 16 2021 01:01:50 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 78553594 device-pcts (1)
Dec 16 2021 01:01:55 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 81388341 device-pcts (1)
Dec 16 2021 01:02:00 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 84199378 device-pcts (1)
Dec 16 2021 01:02:05 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 86988381 device-pcts (1)
Dec 16 2021 01:02:10 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 89757486 device-pcts (1)
Dec 16 2021 01:02:15 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 92505116 device-pcts (1)
Dec 16 2021 01:02:20 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 95219150 device-pcts (1)
Dec 16 2021 01:02:25 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 97877880 device-pcts (1)
Dec 16 2021 01:02:30 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 100520942 device-pcts (1)
Dec 16 2021 01:02:35 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 103145544 device-pcts (1)
Dec 16 2021 01:02:40 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 105790369 device-pcts (1)
Dec 16 2021 01:02:45 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 108445006 device-pcts (1)
Dec 16 2021 01:02:50 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 111086340 device-pcts (1)
Dec 16 2021 01:02:55 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 113713692 device-pcts (1)
Dec 16 2021 01:03:00 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 116330018 device-pcts (1)
Dec 16 2021 01:03:05 GMT: INFO (drv_ssd): (drv_ssd.c:3681) {test} loaded: objects 118935099 device-pcts (1)
Dec 16 2021 01:03:09 GMT: WARNING (nsup): (nsup.c:875) {test} breached stop-writes limit (memory), memory sz:7730972672 (7730954240 + 0 + 18432 + 0) limit:7730941132, disk avail-pct:100
Dec 16 2021 01:03:09 GMT: WARNING (nsup): (nsup.c:1066) {test} hit stop-writes limit
Dec 16 2021 01:03:09 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2377) hit stop-writes limit before drive scan completed
Dec 16 2021 01:03:09 GMT: WARNING (as): (signal.c:217) SIGUSR1 received, aborting Aerospike Community Edition build 5.7.0.8 os ubuntu20.04
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:606) stacktrace: registers: rax 0000000000000000 rbx 00007ff1a9dbe000 rcx 00007ff1aca2524b rdx 0000000000000000 rsi 00007ff15fdfa820 rdi 0000000000000002 rbp 000000000000000a rsp 00007ff15fdfa820 r8 0000000000000000 r9 00007ff15fdfa820 r10 0000000000000008 r11 0000000000000246 r12 0000000000000536 r13 00007ff15f211240 r14 00007ff15fdfaab0 r15 00007ff1a9e00000 rip 00007ff1aca2524b
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:618) stacktrace: found 12 frames: 0x1788f3 0xd1894 0x7ff1aca253c0 0x7ff1aca2524b 0x178289 0x14d586 0x14d75b 0x14dab1 0x169a13 0x169733 0x7ff1aca19609 0x7ff1ac5d0293 offset 0x55b44f090000
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 0: /usr/bin/asd(cf_log_stack_trace+0x116) [0x55b44f2088f3]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 1: /usr/bin/asd(as_sig_handle_usr1+0x38) [0x55b44f161894]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 2: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7ff1aca253c0]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 3: /lib/x86_64-linux-gnu/libpthread.so.0(raise+0xcb) [0x7ff1aca2524b]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 4: /usr/bin/asd(cf_log_write_no_return+0x97) [0x55b44f208289]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 5: /usr/bin/asd(ssd_cold_start_sweep+0) [0x55b44f1dd586]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 6: /usr/bin/asd(ssd_cold_start_sweep+0x1d5) [0x55b44f1dd75b]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 7: /usr/bin/asd(run_ssd_cold_start+0x91) [0x55b44f1ddab1]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 8: /usr/bin/asd(+0x169a13) [0x55b44f1f9a13]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 9: /usr/bin/asd(+0x169733) [0x55b44f1f9733]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff1aca19609]
Dec 16 2021 01:03:09 GMT: WARNING (as): (log.c:629) stacktrace: frame 11: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff1ac5d0293]
aerospike.service: Main process exited, code=killed, status=6/ABRT
aerospike.service: Failed with result 'signal'.

kporter · December 16, 2021, 11:58pm

Yes, this is an assert which is complaining that the configured memory size cannot support the amount of data found on the device. If the records on the device had expirations set, the coldstart would evict the records soonest to expire to make room to start up. Since your records are not expirable (and therefore not evictable), the coldstart is forced to give up.

It is a bit odd that stop-writes is being triggered around 72% utilization - the default for stop-writes-pct is 90% - have you adjusted your stop-writes-pct?

gglawits · December 17, 2021, 1:15am

No, I have not changed stop-writes-pct. The DB has roughly 125 million records at this point - not a lot compared to the planet’s population or even compared to the US population. How do I set expirations on records? I will try to google it. What I am trying to create, for laughs and kicks, is a DB of 330 million social security numbers (pretend ones, not real ones) and their corresponding randomly generated tokens.

kporter · December 17, 2021, 10:19pm

Per the Capacity Planning Guide (which I recommend reviewing), you will need about 40 GiB for the index (assuming replication-factor 2). You should provide headroom for system memory usage and other allocations in aerospike as well as for potential faults in your cluster (such as losing nodes (up to your fault tolerance)).

system · December 17, 2022, 10:19pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aerospike Crashes under heavy load PHP Client Library	5	2435	September 1, 2015
Single bin data-in-memory crash [Released] [Resolved]	9	2204	June 10, 2015
OOM killed. How to file a bug properly Operations	7	2058	January 24, 2017
Aerospike in-memory DB uses much memory than expected	23	7513	July 11, 2022
Node Crash	7	3011	January 12, 2015

When running out of RAM, the DB can paint itself into a corner

Related topics