How my UDF crashed Aerospike


#1

I am new to aerospike and UDF in general. Somehow during my messing around with UDF while trying to figure out how it works, I ended up causing both of my machines in a cluster to go down. Only thing that brings them back up is a reboot of the machine.

Attempts to restart it from terminal end up in

~$ sudo service aerospike restart
[FAIL] Halt aerospike: : asd failed!

The error I got while trying to run a stream query was

scanAggregate Error:  { code: -1,
  message: 'Bad file descriptor',
  func: 'as_socket_read_forever',
  file: 'src/main/aerospike/as_socket.c',
  line: 340 }

I am using test data from the online course.

My UDF is:

local function starts_with(out,rec)
    out[rec.tweet] = tostring(rec.tweet):find('%a')

    local wordCount = 0
    for w in string.gfind(tostring(rec.tweet), "%a") do
      wordCount = wordCount + 1
    end

    local result = map()
    result['occurences'] = tostring(rec.tweet):find('%a')
    result['wordCount'] = wordCount

    local result = map()
    result[rec['tweet']] = result
  return result
end

local function reduce_stats(a,b)
    local out = map()
    out['occurences'] = (a['occurences'] or 0) + (b['occurences'] or 0)
    out['wordCount'] = (a['wordCount'] or 0) + (b['wordCount'] or 0)
    return out
end

function find(stream,chars)
	local myfilter = chars_filter(chars)
	return stream : aggregate(map(),starts_with) : reduce(reduce_stats)
end

What should I know before going with UDF?
#2

Also I am aware of issues with the UDF, just pointing this out so that perhaps this sort of a mistake doesn’t bring down the entire server when another developer uploads something that doesn’t make sense.


#3

Denisbetsi,

Thanks

Can you please share last 100 lines of the server log as well …

  • R

#4
$ tail -n 100 /var/log/aerospike/aerospike.log
Jun 12 2015 12:52:08 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jun 12 2015 12:52:08 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 12 2015 12:52:08 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Jun 12 2015 12:52:08 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 12 2015 12:52:08 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 12 2015 12:52:08 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4796)  system memory: free 3660328kb ( 96 percent free ) 
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 75443 ::: sub_objects 0
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4812)  rec refs 75443 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 686, 686) : hb (2, 2, 0) : fab (30, 30, 0)
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 753276
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 724047 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4860) namespace test: disk inuse: 28722560 memory inuse: 16829514 (bytes) sindex memory inuse: 7021037 (bytes) avail pct 99
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4869) namespace test: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jun 12 2015 12:52:18 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 2073 sync 2023 desync 0 zombie 0 wait 0 absent 0
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 12 2015 12:52:18 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jun 12 2015 12:52:24 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /opt/aerospike/data/test.dat: used 28722560, contig-free 16354M (16354 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s)
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4796)  system memory: free 3660328kb ( 96 percent free ) 
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 75443 ::: sub_objects 0
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4812)  rec refs 75443 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 686, 686) : hb (2, 2, 0) : fab (30, 30, 0)
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 753408
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 724173 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4860) namespace test: disk inuse: 28722560 memory inuse: 16829514 (bytes) sindex memory inuse: 7021037 (bytes) avail pct 99
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4869) namespace test: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jun 12 2015 12:52:28 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 2073 sync 2023 desync 0 zombie 0 wait 0 absent 0
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 12 2015 12:52:28 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4796)  system memory: free 3660320kb ( 96 percent free ) 
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 75443 ::: sub_objects 0
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4812)  rec refs 75443 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 686, 686) : hb (2, 2, 0) : fab (30, 30, 0)
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 753542
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 724303 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4860) namespace test: disk inuse: 28722560 memory inuse: 16829514 (bytes) sindex memory inuse: 7021037 (bytes) avail pct 99
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4869) namespace test: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jun 12 2015 12:52:38 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 2073 sync 2023 desync 0 zombie 0 wait 0 absent 0
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 12 2015 12:52:38 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jun 12 2015 12:52:40 GMT: INFO (nsup): (thr_nsup.c::1319) {test} nsup start
Jun 12 2015 12:52:40 GMT: INFO (nsup): (thr_nsup.c::1237) {test} Records: 38222, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 10 ms
Jun 12 2015 12:52:44 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /opt/aerospike/data/test.dat: used 28722560, contig-free 16354M (16354 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s)
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4796)  system memory: free 3660320kb ( 96 percent free ) 
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 75443 ::: sub_objects 0
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4812)  rec refs 75443 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 686, 686) : hb (2, 2, 0) : fab (30, 30, 0)
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 753674
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 724435 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4860) namespace test: disk inuse: 28722560 memory inuse: 16829514 (bytes) sindex memory inuse: 7021037 (bytes) avail pct 99
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4869) namespace test: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jun 12 2015 12:52:48 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 2073 sync 2023 desync 0 zombie 0 wait 0 absent 0
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 12 2015 12:52:48 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4796)  system memory: free 3660340kb ( 96 percent free ) 
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 75443 ::: sub_objects 0
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4812)  rec refs 75443 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 686, 686) : hb (2, 2, 0) : fab (30, 30, 0)
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 753808
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 724565 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4860) namespace test: disk inuse: 28722560 memory inuse: 16829514 (bytes) sindex memory inuse: 7021037 (bytes) avail pct 99
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4869) namespace test: ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jun 12 2015 12:52:58 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 2073 sync 2023 desync 0 zombie 0 wait 0 absent 0
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 12 2015 12:52:58 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count

#5

Denisbetsi,

Is this the log immediately after the crash ?? I do not see any backtrace/stacktrace.

Server process simply vanishing without any stacktrace generally means it was killed by OOM (Out of Memory) killer. Can you check in

dmesg

output if asd got killed by OOM killer.

– R


#6

I think the log was erased. I didn’t think about capturing it after the crash. :frowning:


#7

Btw scan generally runs slow the bad file descriptor could be driven by the fact that transaction had very low timeout value and it got timedout.

To know what went wrong on server side I will atleast need what server was indicated as cause of crash … Please do share it next time you see the problem.

Only thing that bring back is reboot of machine …

Where are you runnig this amazon / bare metal ??


#8

This was running on Google Cloud