Aerospike server is getting killed because of "out of memory". Math not working out

jin_dev · January 21, 2016, 11:40am

Hi I am running Aerospike Server community edition 3.5.9. I am running 2 node mesh cluster configuration with in-memory data storage. I have started observing a weird issue where Aerospike server starts consuming memory and eventually gets killed because of “out of memory”. Memory utilization of namespaces doesn’t add up at all. I have 4 namespaces and all of them together are not even consuming 10 Mb of memory, where as “asd” process is taking close to 4Gb of memory (which is my entire system memory).

Following are the observations -

Though they are in cluster mode, only 1 of the server starts spiking memory and finally gets killed because of oom.
After this server gets killed, second server also follows the same pattern.
asmonitor confirm successful formation of cluster and shows very little memory utilization by namespace objects.
Entire memory is consumed within a min.

Pasting log entries –

Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4744)  system memory: free 3213928kb ( 79 percent free )
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4752)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 2551 ::: sub_objects 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4760)  rec refs 2551 ::: rec locks 0 ::: trees 0 ::: wr reqs 5 ::: mig tx 0 ::: mig rx 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4766)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4776)    trans_in_progress: wr 1 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (177, 348, 171) : hb (2, 2, 0) : fab (29, 29, 0)
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4778)    heartbeat_received: self 0 : foreign 865
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4779)    heartbeat_stats: bt 0 bf 821 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4792)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 1
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace test: disk inuse: 0 memory inuse: 1033586 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace bar: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace statistics: disk inuse: 0 memory inuse: 30667 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace debug_statistics: disk inuse: 0 memory inuse: 92142 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4853)    partitions: actual 8236 sync 8148 desync 0 zombie 0 wait 0 absent 0


Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4744)  system memory: free 1695028kb ( 41 percent free )
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4752)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 4459 ::: sub_objects 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4760)  rec refs 4459 ::: rec locks 0 ::: trees 0 ::: wr reqs 1 ::: mig tx 0 ::: mig rx 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4766)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4776)    trans_in_progress: wr 1 prox 0 wait 3 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (177, 359, 182) : hb (2, 2, 0) : fab (29, 29, 0)
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4778)    heartbeat_received: self 0 : foreign 999
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4779)    heartbeat_stats: bt 0 bf 951 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4792)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 1
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace test: disk inuse: 0 memory inuse: 1853496 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace bar: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace statistics: disk inuse: 0 memory inuse: 33729 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace debug_statistics: disk inuse: 0 memory inuse: 99694 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4853)    partitions: actual 8236 sync 8148 desync 0 zombie 0 wait 0 absent 0

it’s very evident from the logs that at time “11:42:34” system memory was about 3.2 G free, total memory usage by all the namespaces was less than 2 M. By “11:42:44” system memory was only 1.6G free (1.6 G memory was consumed within 10 sec and I can confirm from top that culprit is asd) but memory consumed by namespaces was just greater than 2M. Logs from dmesg

    2434872.262187] Out of memory: Kill process 21698 (asd) score 922 or sacrifice child
    [2434872.264792] Killed process 21698 (asd) total-vm:9665960kB, anon-rss:3841324kB, file-rss:0kB

this has just started happening, previously entire setup used to work perfectly. Can someone explain what’s happening here?

anushree · February 4, 2016, 8:52pm

Yes, logs do show that in 10 seconds system memory dropped from 79% free to 41% free.

Can you confirm the size of your data?

What operations are you performing on the cluster? Scan, batch?

When you say this setup was working previously, can you confirm if it was on a different version or the same one?

Do you have anything else running on these machines?

Are these in AWS or bare metal?

Do you have the flexibility of upgrading to a newer version ( which includes multitudes of fixes and better logging of statistics) to confirm if the issue persists?

jin_dev · February 17, 2016, 7:32am

I don’t see this issue any more in 3.5.9 it self. This issue disappeared on it’s own. One suspect is that there was some bug from application side in udf arguments which might have exposed some leak at aerospike server level.

I can confirm that there was nothing else running on the system and entire memory was consumed by aerospike server only.

Topic		Replies	Views
Asd invoked oom-killer	2	746	September 24, 2021
Aerospike memory usage spikes up linearly Operations shared-memory	7	2005	October 5, 2022
Aerospike in-memory DB uses much memory than expected	23	7466	July 11, 2022
Bug leak memory and lost data on Aerospike	8	1310	March 4, 2020
OOM killed. How to file a bug properly Operations	7	2045	January 24, 2017

Aerospike server is getting killed because of "out of memory". Math not working out

Related topics