Aerospike server will crash if match more than 7 records when Aggregate


#1

[Environment]

C client Version(Ubuntu 12) : 3.0.94.ubuntu12.04.x86_64

Server Version(Centos 6): community-3.4.1-1.el6.x86_64

When execute aggregate in aql or my c application,


If the record count of stream less than 7, execute is OK.

aql> AGGREGATE dsp_freq.getSolution('JSON{}') ON camel.freq_s WHERE cookie = 'cookie_freq_001'

±----------------------------------------------+ | getSolution | ±----------------------------------------------+ | [“10000”, “10001”, “10002”, “10003”, “10004”] | ±----------------------------------------------+ 1 row in set (0.253 secs)



If more than 7 (7 or 10), server crashed

aql> AGGREGATE dsp_freq.getSolution('JSON{}') ON camel.freq_s WHERE cookie = 'cookie_freq_001'

Error: (-1) AEROSPIKE_ERR_CLIENT

Log (aerospike.log) shows :

Mar 06 2015 06:20:38 GMT: WARNING (as): (signal.c::127) SIGSEGV received, aborting Aerospike Community Edition build 3.4.1


dsp_freq.getSolution just return bin-value of all records .


My server config is default. Why?


SIGSEGV received, aborting (3.5.4 CE) running LUA
#2

Server would dump entire stack trace in the log. Can you share it…

– R


#3

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::856) scan job received

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::907) scan_option 0x8 0x64

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::964) NO bins specified select all

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::998) scan option: Fail if cluster change True

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::999) scan option: Background Job False

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::1000) scan option: priority is 0 n_threads 3 job_type 1

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::1001) scan option: scan_pct is 100

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::384) SCAN JOB DONE [id =1: ns= camel set=freq_s scanned=10 expired=0 set_diff=10 elapsed=404 (ms)]

Mar 06 2015 08:22:31 GMT: INFO (scan): (thr_tscan.c::1460) Scan Job 1: send final message: fd 61 result 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4565) system memory: free 32602036kb ( 99 percent free )

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4573) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 20 ::: sub_objects 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4581) rec refs 20 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4587) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4597) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (3, 16, 13) : hb (0, 0, 0) : fab (16, 16, 0)

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4599) heartbeat_received: self 1234 : foreign 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4600) heartbeat_stats: bt 0 bf 0 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4613) tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4629) namespace test: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4629) namespace mytest: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 100

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4629) namespace camel: disk inuse: 0 memory inuse: 6180 (bytes) sindex memory inuse: 56726 (bytes) avail pct 100

Mar 06 2015 08:22:34 GMT: INFO (info): (thr_info.c::4674) partitions: actual 12288 sync 0 desync 0 zombie 0 wait 0 absent 0

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: udf (1440 total) msec

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::163) (00: 0000001440)

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: query (1 total) msec

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::163) (01: 0000000001)

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (1 total) count

Mar 06 2015 08:22:34 GMT: INFO (info): (hist.c::163) (01: 0000000001)

Mar 06 2015 08:22:38 GMT: WARNING (as): (signal.c::127) SIGSEGV received, aborting Aerospike Community Edition build 3.4.1

Mar 06 2015 08:22:38 GMT: WARNING (as): (signal.c::129) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x46b5b4]

Mar 06 2015 08:22:38 GMT: WARNING (as): (signal.c::129) stacktrace: frame 1: /lib64/libc.so.6() [0x3f734326a0]

Mar 06 2015 08:22:38 GMT: WARNING (as): (signal.c::129) stacktrace: frame 2: [0x7f42a5d79e72]


See last 4 lines


#4

Is it possible for you to provide sample data set and lua file to help track down the problem? Thanks in advance.


#5
  • Namespace is camel, Set is freq_s

  • register dsp_freq.lua below


local function min_key_of_map(m)
    local min = nil
    for key in map.keys(m) do
        if (min == nil) or (key < min) then
            min = key
        end
    end

    return min
end


-- return value
-- 0       : success
-- -10001  : ts is invalid
-- other   : error of aerospike:update or aerospike:create

local function set_freq(topRecord,ts)

    -- check ts
    if  ts < 10000 then
        return -10001
    end


    -- compute date
    local date = os.date('*t', ts) -- make sure server' localtime is set to Asia/Shanghai
    local mon =  date.year * 100 + date.month  -- year and month
    local week = date.year * 100 + (date.yday + 6)/7 -- year and week
    local day =  mon * 100 + date.day -- year, month and day

    -- update or create record
    if aerospike:exists(topRecord)  then -- update exist record

        -- update total 
        topRecord['total'] = (topRecord['total'] or 0) + 1

        -- update month data 
        local mon_data = topRecord['mon']
        mon_data[mon] = (mon_data[mon] or 0) + 1

        -- remove oldest month data if need 
        if  (map.size(mon_data) > 2) then
            map.remove(mon_data, min_key_of_map(mon_data))
        end

        topRecord['mon'] = mon_data
        

        -- update week data 
        local week_data = topRecord['week']
        week_data[week] = (week_data[week] or 0) + 1

        -- remove oldest week data if need 
        if  (map.size(week_data) > 2) then
            map.remove(week_data, min_key_of_map(week_data))
        end

        topRecord['week'] = week_data
        

        -- update day data
        local day_data = topRecord['day']

        local l = day_data[day] or list{0,  0,0,0,0,0,0,  0,0,0,0,0,0,  0,0,0,0,0,0,  0,0,0,0,0,0} -- 25 datas (total + 24 hours )
        l[1] = l[1] + 1 -- day's total count 
        local index = date.hour + 2
        l[index] = l[index] + 1
        day_data[day] = l

        -- remove oldest day data if need 
        if  (map.size(day_data) > 14) then
            map.remove(day_data, min_key_of_map(day_data))
        end
        
        topRecord['day'] = day_data


        -- commit all updates 
        return aerospike:update(topRecord)
        
    else -- create new record
        

        -- set total
        topRecord['total'] = 1

        -- set mon data
        local mon_data = map.new(1)
        mon_data[mon] = 1
        topRecord['mon'] = mon_data

        -- set week data
        local week_data = map.new(1)
        week_data[week] = 1
        topRecord['week'] = week_data 

        -- set day data
        local day_data = map.new(1)
        local l = list{1,  0,0,0,0,0,0,  0,0,0,0,0,0,  0,0,0,0,0,0,  0,0,0,0,0,0} -- 25 datas (total + 24 hours )
        local index = date.hour + 2
        l[index] = 1
        day_data[day] = l
        topRecord['day'] = day_data

        -- commit settings
        return aerospike:create(topRecord)
    end
end

-- return value same as function set_freq(...)
function set_freq_solution(topRecord,cookie,sid,ts)

    topRecord['ts'] = ts
    topRecord['cookie'] = cookie
    topRecord['sid'] = tostring(sid)

    return set_freq(topRecord,ts)
end

local function get_filter_test(condition,id_name)

    return function (r)
		return true
	end 
end
        
local function get_map(id_name)
	return function (rec)
		return list{rec[id_name]}
	end
end
        
local function my_reduce(l1,l2)
	list.concat(l1,l2)

	return l1
end
        
function getSolution(stream,condition)

    local id_name = 'sid'
    local my_filter = get_filter_test(condition,id_name)
    local my_map = get_map(id_name)
    return stream:filter(my_filter):map(my_map):reduce(my_reduce)
end
  
  • Create Index
create index s_index_cooke_s on camel.freq_s(cookie) STRING
  • Insert records
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10000,1425535188) ON camel.freq_s WHERE PK = '0'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10001,1425535188) ON camel.freq_s WHERE PK = '1'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10002,1425535188) ON camel.freq_s WHERE PK = '2'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10003,1425535188) ON camel.freq_s WHERE PK = '3'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10004,1425535188) ON camel.freq_s WHERE PK = '4'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10005,1425535188) ON camel.freq_s WHERE PK = '5'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10006,1425535188) ON camel.freq_s WHERE PK = '6'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10007,1425535188) ON camel.freq_s WHERE PK = '7'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10008,1425535188) ON camel.freq_s WHERE PK = '8'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10009,1425535188) ON camel.freq_s WHERE PK = '9'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10010,1425535188) ON camel.freq_s WHERE PK = '10'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10011,1425535188) ON camel.freq_s WHERE PK = '11'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10012,1425535188) ON camel.freq_s WHERE PK = '12'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10013,1425535188) ON camel.freq_s WHERE PK = '13'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10014,1425535188) ON camel.freq_s WHERE PK = '14'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10015,1425535188) ON camel.freq_s WHERE PK = '15'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10016,1425535188) ON camel.freq_s WHERE PK = '16'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10017,1425535188) ON camel.freq_s WHERE PK = '17'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10018,1425535188) ON camel.freq_s WHERE PK = '18'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10019,1425535188) ON camel.freq_s WHERE PK = '19'
EXECUTE dsp_freq.set_freq_solution('cookie_freq_001',10020,1425535188) ON camel.freq_s WHERE PK = '20'
  • Map reduce (aerospike server will crash)
AGGREGATE dsp_freq.getSolution('JSON{}') ON camel.freq_s WHERE cookie = 'cookie_freq_001'





  • The server is in memory-only mode. Use default setting.
  • If records count is small (< 7), the server not crash
  • If use “insert into” command (not udf) to insert records(100 lines), then map reduce, the server not crash.

#6

Thanks; taking a deeper look.


#7

Thanks for the information. We were able to reproduce the crash. The crash is due to a Lua cache problem that has been fixed for the next server release (version 3.5.4).

In the meantime, there are two workaround options:

  1. Disable the Lua cache by adding this mod-lua section to your server config file

mod-lua {

cache-enabled false

}

(if you already have a mod-lua section simply add the cache-enabled setting to it)

  1. Separate your record and stream UDFs into different Lua files. [That is, put set_freq_solution and getSolution in separate files.] This means the two types of UDFs will not be cached together, which should avoid the problem.

Thanks,

-Brad


#8

Thanks.

Has option 1 any weaknesses? Lose performance?


#9

Lose performance. Every lua execution will create and setup lua state and destroy it after executing the command

– R