ASD crash and memory leak with LDT bin


#1

Aerospike 3.3.8 with In-Memory (50GB) + Disk Persistence, on HP G6 server (24 CPU-threads, 72GB memory). Running local C Client with 4 threads, with/without LDT Bin (using lmap for sub-records with 1KB size of bytes).

Traffic: Write at 10K TPS + Read at 10K TPS + Delete at 10K TPS (The records stay in DB for 1 hour before Deleted)

Without LDT Bin, no memory leak. With LDT Bin, the invoking to aerospike_key_get() caused server process asd memory leak: memory kept growing and finally killed by OS:

    Out of memory: Kill process 15577 (asd) score 957 or sacrifice child
indent preformatted text by 4 spacesKilled process 15577, UID 0, (asd) total-vm:81134032kB, anon-rss:73154204kB, file-rss:852kB
Aug 08 2014 03:07:41 GMT: INFO (storage): (storage.c::95) waiting for storage: 20648782 objects, 24892413 scanned
Aug 08 2014 03:07:43 GMT: INFO (storage): (storage.c::95) waiting for storage: 20705274 objects, 24966302 scanned
Aug 08 2014 03:07:45 GMT: INFO (storage): (storage.c::95) waiting for storage: 20760275 objects, 25036315 scanned
Aug 08 2014 03:07:47 GMT: INFO (storage): (storage.c::95) waiting for storage: 20816040 objects, 25105305 scanned
Aug 08 2014 03:07:49 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /instances/aerospike/data/test.dat: free 68717M contig 68717M  w-q 0 w-free 65534 swb-free 0 w-tot 0
Aug 08 2014 03:07:49 GMT: INFO (storage): (storage.c::95) waiting for storage: 20867324 objects, 25166829 scanned
Aug 08 2014 03:07:51 GMT: INFO (storage): (storage.c::95) waiting for storage: 20922311 objects, 25233699 scanned
Aug 08 2014 03:07:51 GMT: INFO (drv_ssd): (drv_ssd.c::950) /instances/aerospike/data/test.dat defrag curr_pos 65535 wblocks:0 recs:0 waits:0 lock-time:1 ms total-time:1 ms
Aug 08 2014 03:07:53 GMT: INFO (storage): (storage.c::95) waiting for storage: 20979147 objects, 25305017 scanned
Aug 08 2014 03:07:55 GMT: INFO (storage): (storage.c::95) waiting for storage: 21095540 objects, 25450837 scanned
Aug 08 2014 03:07:57 GMT: INFO (storage): (storage.c::95) waiting for storage: 21220001 objects, 25600483 scanned
Aug 08 2014 03:07:59 GMT: INFO (storage): (storage.c::95) waiting for storage: 21337776 objects, 25748033 scanned
Aug 08 2014 03:08:01 GMT: INFO (storage): (storage.c::95) waiting for storage: 21458780 objects, 25899390 scanned
Aug 08 2014 03:08:02 GMT: INFO (drv_ssd): (drv_ssd.c::950) /instances/aerospike/data/test.dat defrag curr_pos 65535 wblocks:0 recs:0 waits:0 lock-time:1 ms total-time:1 ms
Aug 08 2014 03:08:03 GMT: INFO (storage): (storage.c::95) waiting for storage: 21578833 objects, 26046883 scanned
Aug 08 2014 03:08:05 GMT: INFO (storage): (storage.c::95) waiting for storage: 21702536 objects, 26204611 scanned
Aug 08 2014 03:08:07 GMT: INFO (storage): (storage.c::95) waiting for storage: 21806315 objects, 26331646 scanned
Aug 08 2014 03:08:09 GMT: INFO (namespace): (namespace.c::459) {mytest} hwm_breached true (memory), stop_writes false, memory sz:32212796650 (1401505088 + 30257526882) hwm:32212256768 sw:48318382080
Aug 08 2014 03:08:09 GMT: INFO (nsup): (thr_nsup.c::347) {mytest} cold-start building eviction histogram ...
Aug 08 2014 03:08:09 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /instances/aerospike/data/test.dat: free 68717M contig 68717M  w-q 0 w-free 65534 swb-free 0 w-tot 0
Aug 08 2014 03:08:09 GMT: INFO (storage): (storage.c::95) waiting for storage: 21898518 objects, 26447404 scanned
Aug 08 2014 03:08:11 GMT: INFO (storage): (storage.c::95) waiting for storage: 21898518 objects, 26447404 scanned
Aug 08 2014 03:08:11 GMT: WARNING (nsup): (thr_nsup.c::283) {mytest} cold-start can't evict - no records eligible
Aug 08 2014 03:08:11 GMT: INFO (nsup): (thr_nsup.c::387) {mytest} cold-start evict ttls 0,0,0.000
Aug 08 2014 03:08:13 GMT: INFO (storage): (storage.c::95) waiting for storage: 21898518 objects, 26447404 scanned
Aug 08 2014 03:08:13 GMT: INFO (drv_ssd): (drv_ssd.c::950) /instances/aerospike/data/test.dat defrag curr_pos 65535 wblocks:0 recs:0 waits:0 lock-time:1 ms total-time:1 ms
Aug 08 2014 03:08:14 GMT: INFO (nsup): (thr_nsup.c::406) {mytest} cold-start evicted 0 records, found 21898517 0-void-time records
Aug 08 2014 03:08:14 GMT: WARNING (nsup): (thr_nsup.c::410) {mytest} could not evict any records
Aug 08 2014 03:08:14 GMT: WARNING (drv_ssd): (drv_ssd.c::2681) device /instances/aerospike/data/mytest.dat: record-add halting read
Aug 08 2014 03:08:14 GMT: WARNING (drv_ssd): (drv_ssd.c::3105) disk restore: hit high water limit before disk entirely loaded.
Aug 08 2014 03:08:14 GMT: INFO (namespace): (namespace.c::459) {mytest} hwm_breached true (memory), stop_writes false, memory sz:32212796650 (1401505088 + 30257526882) hwm:32212256768 sw:48318382080
Aug 08 2014 03:08:14 GMT: INFO (nsup): (thr_nsup.c::347) {mytest} cold-start building eviction histogram ...
Aug 08 2014 03:08:15 GMT: INFO (storage): (storage.c::95) waiting for storage: 21898518 objects, 26448428 scanned
Aug 08 2014 03:08:17 GMT: WARNING (nsup): (thr_nsup.c::283) {mytest} cold-start can't evict - no records eligible
Aug 08 2014 03:08:17 GMT: INFO (nsup): (thr_nsup.c::387) {mytest} cold-start evict ttls 0,0,0.000
Aug 08 2014 03:08:17 GMT: INFO (storage): (storage.c::95) waiting for storage: 21898518 objects, 26448428 scanned
Aug 08 2014 03:08:19 GMT: INFO (storage): (storage.c::95) waiting for storage: 21898518 objects, 26448428 scanned
Aug 08 2014 03:08:19 GMT: INFO (nsup): (thr_nsup.c::406) {mytest} cold-start evicted 0 records, found 21898517 0-void-time records
Aug 08 2014 03:08:19 GMT: WARNING (nsup): (thr_nsup.c::410) {mytest} could not evict any records
Aug 08 2014 03:08:19 GMT: WARNING (drv_ssd): (drv_ssd.c::2681) device /instances/aerospike/data/mytest.dat: record-add halting read
Aug 08 2014 03:08:19 GMT: WARNING (drv_ssd): (drv_ssd.c::3105) disk restore: hit high water limit before disk entirely loaded.
Aug 08 2014 03:08:19 GMT: INFO (drv_ssd): (drv_ssd.c::3179) finished: marking blocks free: block 19453 nblocks 536851459
Aug 08 2014 03:08:19 GMT: INFO (drv_ssd): (drv_ssd.c::3247) device /instances/aerospike/data/mytest.dat: read complete: READ 26242875 (GEN 31941) (EXPIRED 0) (MAX-TTL 0) records
Aug 08 2014 03:08:19 GMT: WARNING (as): (signal.c::150) SIGSEGV received, aborting Aerospike Community Edition build 3.3.8
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x46647c]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 1: /lib64/libc.so.6() [0x300c0329a0]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 2: /lib64/libc.so.6(_IO_vfprintf+0x3e5c) [0x300c04812c]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 3: /lib64/libc.so.6(vsnprintf+0xa2) [0x300c06fa52]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x1ac) [0x4e9d1c]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 5: /usr/bin/asd(ssd_load_devices_fn+0x40a) [0x4e2c2f]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 6: /lib64/libpthread.so.0() [0x300c8079d1]
Aug 08 2014 03:08:20 GMT: WARNING (as): (signal.c::157) stacktrace: frame 7: /lib64/libc.so.6(clone+0x6d) [0x300c0e8b6d]

I finally got it up by removing the disk persistence data file (so back with an empty DB).

No memory leak after I commented out the code lines of aerospike_key_get() in C client.


#2

Aerospike release the latest build today. http://aerospike.com/download/server/3.3.12/

Would it be possible for you to re-run your test on version 3.3.12 ?


#3

Correction: Still has memory leak even I commented out the code lines of aerospike_key_get(). I realized that aerospike_key_remove() may also related to the memory leak.

Question: Shall I use aerospike_lmap_remove() to delete all sub-records before aerospike_key_remove() a root-record?


#4

I’ll try with the new version 3.3.12


#5

Just tried with 3.3.12, but got the same results as before.


#6

A number of patches were implemented on latest versions in regards to LDT. Please see:

http://www.aerospike.com/download/server/notes.html#3.3.19

best,

Lucien