SIGSEGV received, aborting (3.5.4 CE) running LUA


#1

We have a strange problem with Aerospike CE after we started using our LUA udf function for data storage. The server runs under our repeated test load for up-to 1-1.5 hour without problems, then suddenly stops with SIGSEGV. Until we started running LUA, we had no problems.

We are using basic structures in bins (MAP, LIST) , not LDT (large data types). The test load generator is repeating store of basic user profile data. We also tried to fully clean SSD storage before starting aerospike running using “dd”, and before that, on previous test we removed all records from all sets.

We already disabled lua code caching in config as per

Data is stored on raw device (Intel SSD 3700) running for DEV on Vsphere. Enviorment: Linux lju1082 3.13.0-46-generic #77~precise1-Ubuntu SMP Mon Mar 2 23:24:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux. Client: latest Java Aerospike client (from Maven)

Error in aerospike.log

Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => start
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => appended value to list
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => touched to insure update
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => record updated
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.4
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x54) [0x46bad1]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f4d4efac4
a0]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 2: /usr/bin/asd(as_particle_frombuf+0x11a) [0x4635ed]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd() [0x4e5881]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 4: /usr/bin/asd(as_storage_particle_read_all_ssd+0xca) [0x
4e7796]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 5: /usr/bin/asd(as_bin_get_all+0x7d) [0x44c679]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 6: /usr/bin/asd(udf_storage_record_open+0xb9) [0x4b9c3b]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 7: /usr/bin/asd(udf_rw_local+0x39e) [0x4bd5be]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 8: /usr/bin/asd() [0x4a758b]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 9: /usr/bin/asd(as_rw_start+0x298) [0x4a91c1]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 10: /usr/bin/asd(process_transaction+0xe61) [0x4b3e4b]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 11: /usr/bin/asd(thr_tsvc+0x2e) [0x4b427d]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f4d
4fe72e9a]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 13: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4d4f0
6a3fd]

Our UDF (map_utils.lua):

--Basic data structure: 
--Record (userKey-cookieId)->Bin("imps.history")->Map("campaignid")->List([epochtime,...]

function append(r, bin, key, value)
	local prefix = "map_utils.append ("..tostring(bin)..", "..tostring(key)..", "..tostring(value)..") => "
    info(prefix.."start")
    --create the map bin if it does not exist
    local bv = r[bin] 
	if bv == nil then
		info(prefix.."new map created (new user)")
		bv = map()
	end

	--create a list in the value of the map if it does not exist
	if bv[key] == nil then
		info(prefix.."new list (first time seen campaign)")
		bv[key] = list()
	end

	--append the new value to the end of the list
	list.append(bv[key], value)
	info(prefix.."appended value to list")

	--touch to insure update
	r[bin] = bv
	info(prefix.."touched to insure update")

	--create the record if it does not exist
    if aerospike:exists(r) then
    	--update
    	aerospike:update(r)
    	info(prefix.."record updated")
    else
		--create
    	aerospike:create(r)
    	info(prefix.."record created")
	end

    --return the list
    --return bv[key]
    --return list size
    return list.size(bv[key])
end

Our config file:

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 8
        transaction-queues 8
        transaction-threads-per-queue 8
        proto-fd-max 15000
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode multicast
                address 239.1.99.82
                port 9918
                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }
 
        info {
                port 3003
        }
}

namespace dsp1 {
        replication-factor 1
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine device {
                # Use the 2 lines below with actual device paths.
                device /dev/sdb

                # The 2 lines below optimize for SSD.
                scheduler-mode noop
                write-block-size 128K

                # Use the line below to store data in memory in addition to devices.
#               data-in-memory true
        }
}

Some basic info regarding from aerospike.log (it this could influence the problem)

Apr 01 2015 15:48:19 GMT: INFO (config): (cfg.c::3033) system file descriptor limit: 100000, proto-fd-max: 15000
Apr 01 2015 15:48:19 GMT: INFO (cf:misc): (id.c::119) Node ip: 192.168.1.82
Apr 01 2015 15:48:19 GMT: INFO (config): (cfg.c::3089) Rack Aware mode not enabled
Apr 01 2015 15:48:19 GMT: INFO (config): (cfg.c::3092) Node id bb938caa7290c00
Apr 01 2015 15:48:19 GMT: INFO (namespace): (namespace_cold.c::101) ns dsp1 beginning COLD start
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3881) opened device /dev/sdb: usable size 10737418240, io-min-size 512
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3741) storage: set device /dev/sdb scheduler mode to noop
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::1032) /dev/sdb has 81920 wblocks of size 131072
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3421) device /dev/sdb: reading device to load index
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3426) In TID 11786: Using arena #150 for loading data for namespace "dsp1"
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::3447) device /dev/sdb: read complete: UNIQUE 2733 (REPLACED 1242) (GEN 619) (E
XPIRED 0) (MAX-TTL 0) records
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::997) ns dsp1 loading free & defrag queues
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::931) /dev/sdb init defrag profile: 0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2,2,0,0,0,0,0
,2,8,6,3,0,0,0,2,4,9,5,2,0,0,0,3,8,18,10,2,2,0,1,5,18,15,11,2
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::1021) /dev/sdb init wblock free-q 81594, defrag-q 142
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::2585) ns dsp1 starting device maintenance threads
Apr 01 2015 15:48:20 GMT: INFO (as): (as.c::410) initializing services...
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::1703) ns dsp1 starting write worker threads
Apr 01 2015 15:48:20 GMT: INFO (tsvc): (thr_tsvc.c::1000) shared queues: 8 queues with 8 threads each
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::848) ns dsp1 starting defrag threads
Apr 01 2015 15:48:20 GMT: INFO (hb): (hb.c::2442) heartbeat socket initialization
Apr 01 2015 15:48:20 GMT: INFO (hb): (hb.c::2446) initializing multicast heartbeat socket : 239.1.99.82:9918
Apr 01 2015 15:48:20 GMT: INFO (paxos): (paxos.c::3139) partitions from storage: total 4096 found 4096 lost(set) 0 lost(unset) 0
Apr 01 2015 15:48:20 GMT: INFO (config): (cluster_config.c::406) Rack Aware is disabled.
Apr 01 2015 15:48:20 GMT: INFO (partition): (cluster_config.c::368) Rack Aware is disabled.
Apr 01 2015 15:48:20 GMT: INFO (partition): (partition.c::2810) CLUSTER SIZE = 1
Apr 01 2015 15:48:20 GMT: INFO (partition): (partition.c::2849) Global state is well formed
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::2503) setting replication factors: cluster size 1, paxos single replica limit
1
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::2510) {dsp1} replication factor is 1
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::3755) global partition state: total 4096 lost 0 unique 4096 duplicate 0
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::3756) partition state after fixing lost partitions (master): total 4096 lost 0
 unique 4096 duplicate 0
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::3757) 0 new partition version tree paths generated
Apr 01 2015 15:48:20 GMT: INFO (partition): (partition.c::364) ALLOW MIGRATIONS
Apr 01 2015 15:48:20 GMT: INFO (paxos): (paxos.c::3143) Paxos service ignited: bb938caa7290c00
Apr 01 2015 15:48:20 GMT: INFO (scan): (thr_tscan.c::2081) started 32 threads
Apr 01 2015 15:48:20 GMT: INFO (batch): (thr_batch.c::342) Initialize 4 batch worker threads.
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::4316) {dsp1} floor set at 89 wblocks per device
Apr 01 2015 15:48:23 GMT: INFO (paxos): (paxos.c::3205) paxos supervisor thread started
Apr 01 2015 15:48:23 GMT: INFO (nsup): (thr_nsup.c::1196) namespace supervisor started
Apr 01 2015 15:48:23 GMT: INFO (ldt): (thr_nsup.c::1153) LDT supervisor started
Apr 01 2015 15:48:23 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #7 for thr_demarshal()
Apr 01 2015 15:48:23 GMT: INFO (demarshal): (thr_demarshal.c::249) Service started: socket 3000

#2

We currently think, that this starts when record has more than 256 bins. Also we find this bug only on specific server.

Is there any limit on number of bins in one record when running Lua?


#3

I also see SIGSEGV when doing a test. The aerospike db was setup on a very small server because I was testing my application server. I am using UDF to access a LDT bin of large list. I am sure the number of bins in a record is very small. Any updates on this issue? Thanks.

Apr 09 2015 00:21:40 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Apr 09 2015 00:21:40 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Apr 09 2015 00:21:52 GMT: INFO (info): (thr_info.c::4580)  system memory: free 119308kb ( 11 percent free ) 
Apr 09 2015 00:21:51 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /opt/aerospike/data/moo.dat: used 37327232, contig-free 1790M (1790 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1232 (0.1/s), defrag-q 0 def
rag-tot 958 (0.1/s)
Apr 09 2015 00:21:54 GMT: INFO (info): (thr_info.c::4588)  migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 201813 ::: sub_objects 21
Apr 09 2015 00:22:03 GMT: INFO (info): (thr_info.c::4596)  rec refs 201834 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Apr 09 2015 00:22:04 GMT: INFO (info): (thr_info.c::4602)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Apr 09 2015 00:22:06 GMT: INFO (info): (thr_info.c::4612)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (96, 5462, 5366) : hb (0, 0, 0) : fab (16, 16, 0)
Apr 09 2015 00:22:07 GMT: INFO (nsup): (thr_nsup.c::1226) {test} nsup start
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4614)    heartbeat_received: self 1038961 : foreign 0
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4615)    heartbeat_stats: bt 0 bf 0 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4628)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4644) namespace test: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4653) namespace test: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Apr 09 2015 00:22:07 GMT: INFO (nsup): (thr_nsup.c::1144) {test} Records: 0, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 475 ms
Apr 09 2015 00:22:07 GMT: INFO (nsup): (thr_nsup.c::1226) {moo} nsup start
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4669) namespace moo: disk inuse: 37327232 memory inuse: 12916032 (bytes) sindex memory inuse: 0 (bytes) avail pct 87 cache-read pct 100.00
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4678) namespace moo: ldt_gc: cnt 4269 io 8538 gc 0 (0, 0, 0)
Apr 09 2015 00:22:07 GMT: INFO (info): (thr_info.c::4689)    partitions: actual 8192 sync 0 desync 0 zombie 0 wait 0 absent 0
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: reads (492982 total) msec
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (00: 0000324005) (01: 0000055246) (02: 0000046489) (03: 0000036999)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (04: 0000022234) (05: 0000006522) (06: 0000001141) (07: 0000000252)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::163)  (08: 0000000073) (09: 0000000021)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (100085 total) msec
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (00: 0000068735) (01: 0000009666) (02: 0000008083) (03: 0000007522)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (04: 0000004735) (05: 0000001165) (06: 0000000171) (07: 0000000008)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (100015 total) msec
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (00: 0000068683) (01: 0000009659) (02: 0000008076) (03: 0000007519)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (04: 0000004734) (05: 0000001165) (06: 0000000171) (07: 0000000008)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: udf (489881 total) msec
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (00: 0000444683) (01: 0000023109) (02: 0000011081) (03: 0000006849)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (04: 0000003181) (05: 0000000707) (06: 0000000154) (07: 0000000068)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::154)  (08: 0000000033) (09: 0000000009) (10: 0000000005) (11: 0000000002)
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Apr 09 2015 00:22:07 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Apr 09 2015 00:22:07 GMT: INFO (nsup): (thr_nsup.c::1144) {moo} Records: 201813, 201813 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 44 ms
Apr 09 2015 00:22:11 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /opt/aerospike/data/moo.dat: used 37327232, contig-free 1790M (1790 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1232 (0.0/s), defrag-q 0 defrag-tot 958 (0.0/s)
Apr 09 2015 00:22:13 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.4
Apr 09 2015 00:22:13 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.4
Apr 09 2015 00:22:13 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.4
Apr 09 2015 00:22:13 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.4

#4

Sorry for the delayed response.

Mark - yes, currently UDFs only support records with 256 bins or less (also, no more than 100 bins may be updated in a single UDF execution). We will fix the problem of crashing when a record contains over 256 bins.

lming - your issue sounds different, and I suspect it is related to large lists. We have made a number of improvements to LDTs (and large lists in particular) since server 3.5.4, and with the next upcoming server release it is very possible that your test will no longer crash. If you still get crashes after upgrading, please provide some example code and we will investigate.

Thanks,

-Brad