Aerospike server is getting killed because of "out of memory". Math not working out


#1

Hi I am running Aerospike Server community edition 3.5.9. I am running 2 node mesh cluster configuration with in-memory data storage. I have started observing a weird issue where Aerospike server starts consuming memory and eventually gets killed because of “out of memory”. Memory utilization of namespaces doesn’t add up at all. I have 4 namespaces and all of them together are not even consuming 10 Mb of memory, where as “asd” process is taking close to 4Gb of memory (which is my entire system memory).

Following are the observations -

  1. Though they are in cluster mode, only 1 of the server starts spiking memory and finally gets killed because of oom.
  2. After this server gets killed, second server also follows the same pattern.
  3. asmonitor confirm successful formation of cluster and shows very little memory utilization by namespace objects.
  4. Entire memory is consumed within a min.

Pasting log entries –

Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4744)  system memory: free 3213928kb ( 79 percent free )
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4752)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 2551 ::: sub_objects 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4760)  rec refs 2551 ::: rec locks 0 ::: trees 0 ::: wr reqs 5 ::: mig tx 0 ::: mig rx 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4766)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4776)    trans_in_progress: wr 1 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (177, 348, 171) : hb (2, 2, 0) : fab (29, 29, 0)
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4778)    heartbeat_received: self 0 : foreign 865
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4779)    heartbeat_stats: bt 0 bf 821 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4792)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 1
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace test: disk inuse: 0 memory inuse: 1033586 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace bar: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace statistics: disk inuse: 0 memory inuse: 30667 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4808) namespace debug_statistics: disk inuse: 0 memory inuse: 92142 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:34 GMT: INFO (info): (thr_info.c::4853)    partitions: actual 8236 sync 8148 desync 0 zombie 0 wait 0 absent 0


Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4744)  system memory: free 1695028kb ( 41 percent free )
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4752)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 4459 ::: sub_objects 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4760)  rec refs 4459 ::: rec locks 0 ::: trees 0 ::: wr reqs 1 ::: mig tx 0 ::: mig rx 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4766)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4776)    trans_in_progress: wr 1 prox 0 wait 3 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (177, 359, 182) : hb (2, 2, 0) : fab (29, 29, 0)
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4778)    heartbeat_received: self 0 : foreign 999
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4779)    heartbeat_stats: bt 0 bf 951 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4792)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 1
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace test: disk inuse: 0 memory inuse: 1853496 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace bar: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace statistics: disk inuse: 0 memory inuse: 33729 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4808) namespace debug_statistics: disk inuse: 0 memory inuse: 99694 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Jan 21 2016 11:42:44 GMT: INFO (info): (thr_info.c::4853)    partitions: actual 8236 sync 8148 desync 0 zombie 0 wait 0 absent 0

it’s very evident from the logs that at time “11:42:34” system memory was about 3.2 G free, total memory usage by all the namespaces was less than 2 M. By “11:42:44” system memory was only 1.6G free (1.6 G memory was consumed within 10 sec and I can confirm from top that culprit is asd) but memory consumed by namespaces was just greater than 2M. Logs from dmesg

    2434872.262187] Out of memory: Kill process 21698 (asd) score 922 or sacrifice child
    [2434872.264792] Killed process 21698 (asd) total-vm:9665960kB, anon-rss:3841324kB, file-rss:0kB

this has just started happening, previously entire setup used to work perfectly. Can someone explain what’s happening here?


#2

Yes, logs do show that in 10 seconds system memory dropped from 79% free to 41% free.

Can you confirm the size of your data?

What operations are you performing on the cluster? Scan, batch?

When you say this setup was working previously, can you confirm if it was on a different version or the same one?

Do you have anything else running on these machines?

Are these in AWS or bare metal?

Do you have the flexibility of upgrading to a newer version ( which includes multitudes of fixes and better logging of statistics) to confirm if the issue persists?


#3

I don’t see this issue any more in 3.5.9 it self. This issue disappeared on it’s own. One suspect is that there was some bug from application side in udf arguments which might have exposed some leak at aerospike server level.

I can confirm that there was nothing else running on the system and entire memory was consumed by aerospike server only.