We have a strange problem with Aerospike CE after we started using our LUA udf function for data storage. The server runs under our repeated test load for up-to 1-1.5 hour without problems, then suddenly stops with SIGSEGV. Until we started running LUA, we had no problems.
We are using basic structures in bins (MAP, LIST) , not LDT (large data types). The test load generator is repeating store of basic user profile data. We also tried to fully clean SSD storage before starting aerospike running using “dd”, and before that, on previous test we removed all records from all sets.
We already disabled lua code caching in config as per
Data is stored on raw device (Intel SSD 3700) running for DEV on Vsphere. Enviorment: Linux lju1082 3.13.0-46-generic #77~precise1-Ubuntu SMP Mon Mar 2 23:24:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux. Client: latest Java Aerospike client (from Maven)
Error in aerospike.log
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => start
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => appended value to list
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => touched to insure update
Apr 01 2015 15:48:06 GMT: INFO (udf): ([C]::-1) map_utils.append (imps.history, 26, 1427903286) => record updated
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.4
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x54) [0x46bad1]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f4d4efac4
a0]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 2: /usr/bin/asd(as_particle_frombuf+0x11a) [0x4635ed]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd() [0x4e5881]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 4: /usr/bin/asd(as_storage_particle_read_all_ssd+0xca) [0x
4e7796]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 5: /usr/bin/asd(as_bin_get_all+0x7d) [0x44c679]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 6: /usr/bin/asd(udf_storage_record_open+0xb9) [0x4b9c3b]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 7: /usr/bin/asd(udf_rw_local+0x39e) [0x4bd5be]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 8: /usr/bin/asd() [0x4a758b]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 9: /usr/bin/asd(as_rw_start+0x298) [0x4a91c1]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 10: /usr/bin/asd(process_transaction+0xe61) [0x4b3e4b]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 11: /usr/bin/asd(thr_tsvc+0x2e) [0x4b427d]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f4d
4fe72e9a]
Apr 01 2015 15:48:06 GMT: WARNING (as): (signal.c::162) stacktrace: frame 13: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4d4f0
6a3fd]
Our UDF (map_utils.lua):
--Basic data structure:
--Record (userKey-cookieId)->Bin("imps.history")->Map("campaignid")->List([epochtime,...]
function append(r, bin, key, value)
local prefix = "map_utils.append ("..tostring(bin)..", "..tostring(key)..", "..tostring(value)..") => "
info(prefix.."start")
--create the map bin if it does not exist
local bv = r[bin]
if bv == nil then
info(prefix.."new map created (new user)")
bv = map()
end
--create a list in the value of the map if it does not exist
if bv[key] == nil then
info(prefix.."new list (first time seen campaign)")
bv[key] = list()
end
--append the new value to the end of the list
list.append(bv[key], value)
info(prefix.."appended value to list")
--touch to insure update
r[bin] = bv
info(prefix.."touched to insure update")
--create the record if it does not exist
if aerospike:exists(r) then
--update
aerospike:update(r)
info(prefix.."record updated")
else
--create
aerospike:create(r)
info(prefix.."record created")
end
--return the list
--return bv[key]
--return list size
return list.size(bv[key])
end
Our config file:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode multicast
address 239.1.99.82
port 9918
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace dsp1 {
replication-factor 1
memory-size 1G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine device {
# Use the 2 lines below with actual device paths.
device /dev/sdb
# The 2 lines below optimize for SSD.
scheduler-mode noop
write-block-size 128K
# Use the line below to store data in memory in addition to devices.
# data-in-memory true
}
}
Some basic info regarding from aerospike.log (it this could influence the problem)
Apr 01 2015 15:48:19 GMT: INFO (config): (cfg.c::3033) system file descriptor limit: 100000, proto-fd-max: 15000
Apr 01 2015 15:48:19 GMT: INFO (cf:misc): (id.c::119) Node ip: 192.168.1.82
Apr 01 2015 15:48:19 GMT: INFO (config): (cfg.c::3089) Rack Aware mode not enabled
Apr 01 2015 15:48:19 GMT: INFO (config): (cfg.c::3092) Node id bb938caa7290c00
Apr 01 2015 15:48:19 GMT: INFO (namespace): (namespace_cold.c::101) ns dsp1 beginning COLD start
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3881) opened device /dev/sdb: usable size 10737418240, io-min-size 512
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3741) storage: set device /dev/sdb scheduler mode to noop
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::1032) /dev/sdb has 81920 wblocks of size 131072
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3421) device /dev/sdb: reading device to load index
Apr 01 2015 15:48:19 GMT: INFO (drv_ssd): (drv_ssd.c::3426) In TID 11786: Using arena #150 for loading data for namespace "dsp1"
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::3447) device /dev/sdb: read complete: UNIQUE 2733 (REPLACED 1242) (GEN 619) (E
XPIRED 0) (MAX-TTL 0) records
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::997) ns dsp1 loading free & defrag queues
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::931) /dev/sdb init defrag profile: 0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2,2,0,0,0,0,0
,2,8,6,3,0,0,0,2,4,9,5,2,0,0,0,3,8,18,10,2,2,0,1,5,18,15,11,2
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::1021) /dev/sdb init wblock free-q 81594, defrag-q 142
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::2585) ns dsp1 starting device maintenance threads
Apr 01 2015 15:48:20 GMT: INFO (as): (as.c::410) initializing services...
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::1703) ns dsp1 starting write worker threads
Apr 01 2015 15:48:20 GMT: INFO (tsvc): (thr_tsvc.c::1000) shared queues: 8 queues with 8 threads each
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::848) ns dsp1 starting defrag threads
Apr 01 2015 15:48:20 GMT: INFO (hb): (hb.c::2442) heartbeat socket initialization
Apr 01 2015 15:48:20 GMT: INFO (hb): (hb.c::2446) initializing multicast heartbeat socket : 239.1.99.82:9918
Apr 01 2015 15:48:20 GMT: INFO (paxos): (paxos.c::3139) partitions from storage: total 4096 found 4096 lost(set) 0 lost(unset) 0
Apr 01 2015 15:48:20 GMT: INFO (config): (cluster_config.c::406) Rack Aware is disabled.
Apr 01 2015 15:48:20 GMT: INFO (partition): (cluster_config.c::368) Rack Aware is disabled.
Apr 01 2015 15:48:20 GMT: INFO (partition): (partition.c::2810) CLUSTER SIZE = 1
Apr 01 2015 15:48:20 GMT: INFO (partition): (partition.c::2849) Global state is well formed
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::2503) setting replication factors: cluster size 1, paxos single replica limit
1
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::2510) {dsp1} replication factor is 1
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::3755) global partition state: total 4096 lost 0 unique 4096 duplicate 0
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::3756) partition state after fixing lost partitions (master): total 4096 lost 0
unique 4096 duplicate 0
Apr 01 2015 15:48:20 GMT: INFO (paxos): (partition.c::3757) 0 new partition version tree paths generated
Apr 01 2015 15:48:20 GMT: INFO (partition): (partition.c::364) ALLOW MIGRATIONS
Apr 01 2015 15:48:20 GMT: INFO (paxos): (paxos.c::3143) Paxos service ignited: bb938caa7290c00
Apr 01 2015 15:48:20 GMT: INFO (scan): (thr_tscan.c::2081) started 32 threads
Apr 01 2015 15:48:20 GMT: INFO (batch): (thr_batch.c::342) Initialize 4 batch worker threads.
Apr 01 2015 15:48:20 GMT: INFO (drv_ssd): (drv_ssd.c::4316) {dsp1} floor set at 89 wblocks per device
Apr 01 2015 15:48:23 GMT: INFO (paxos): (paxos.c::3205) paxos supervisor thread started
Apr 01 2015 15:48:23 GMT: INFO (nsup): (thr_nsup.c::1196) namespace supervisor started
Apr 01 2015 15:48:23 GMT: INFO (ldt): (thr_nsup.c::1153) LDT supervisor started
Apr 01 2015 15:48:23 GMT: INFO (demarshal): (thr_demarshal.c::221) Saved original JEMalloc arena #7 for thr_demarshal()
Apr 01 2015 15:48:23 GMT: INFO (demarshal): (thr_demarshal.c::249) Service started: socket 3000