SIGSEGV received

martin_ad · March 4, 2015, 7:47pm

Getting com.aerospike.client.AerospikeException$Timeout - server stops with a segment fault after some time. Log shows:

Mar 04 2015 19:19:49 GMT: DEBUG (udf): (udf_record.c:udf_record_cache_free:441) [ENTER] NumUpdates(0)
Mar 04 2015 19:19:49 GMT: DETAIL (query): (thr_query.c:query_agg_istream_read:2677) No More Nodes for this Lua Call
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::824) rc = 0
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::827) convert lua type to val
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::843) pop return value from the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1249) apply_stream: lose the context
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::730) [CACHE] found entry: generatorTask (10)
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::734) [CACHE] returning state: generatorTask (10)
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1253) apply_stream: END
Mar 04 2015 19:19:49 GMT: DEBUG (query): (thr_query.c:as_query__agg:2877)  Apply Stream with generatorTask getTaskStatistics 0x7fd45ffbec40 0x7fd45ffbec00 0x7fd45ffbec20 ret=0
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (udf_memtracker.c:udf_memtracker_cleanup:70) 140550120142592: Memory Tracker reset
Mar 04 2015 19:19:49 GMT: DEBUG (sindex): (secondary_index.c:as_sindex__pre_op_assert:255) DML on index asset-jobRunId in 2 state with reference count 2 < 2
Mar 04 2015 19:19:49 GMT: DETAIL (query): (thr_query.c:as_query__generator_get_nextbatch:1619) start 2 end 2 @ 2 pimd found 10
Mar 04 2015 19:19:49 GMT: DETAIL (query): (thr_query.c:as_query__generator:1788) Loop=3, Selected=10, ret=0
Mar 04 2015 19:19:49 GMT: DETAIL (query): (thr_query.c:as_query__process_request:1506) Processing Request 2
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (udf_memtracker.c:udf_memtracker_setup:63) 140550120142592: Memory Tracker 0x7fd45ffbecc0 set
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1190) apply_stream: BEGIN
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1193) apply_stream: poll state
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::618) [CACHE] took state: generatorTask (10)
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::631) [CACHE] Miss 0 : Total 89
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1208) apply_stream: push aerospike into the global scope
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1213) apply_stream: push apply_stream() onto the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1217) apply_stream: push function onto the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1222) apply_stream: push scope onto the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1226) apply_stream: push istream onto the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1229) apply_stream: push ostream onto the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1233) apply_stream: push each argument onto the stack
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::790) pushargs: 0
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::1244) apply_stream: apply the function
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::802) apply
Mar 04 2015 19:19:49 GMT: DETAIL (udf): (src/main/mod_lua.c::821) call function
Mar 04 2015 19:19:49 GMT: WARNING (as): (signal.c::150) SIGSEGV received, aborting Aerospike Community Edition build 3.4.0
Mar 04 2015 19:19:49 GMT: WARNING (as): (signal.c::157) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x467e4c]
Mar 04 2015 19:19:49 GMT: WARNING (as): (signal.c::157) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7fd59010a6a0]
Mar 04 2015 19:19:49 GMT: WARNING (as): (signal.c::157) stacktrace: frame 2: [0x7fd4825c906b]
Mar 04 2015 19:19:49 GMT: DETAIL (scan): (thr_tscan.c:scan_udf_job_manager:1276) Waiting for job queue

dave · March 5, 2015, 1:55am

Thank you for writing. Before I get started, I can see that you are using the 3.4.0 Community version. Would you confirm the O/S and version where you are running Aerospike?

I’m not yet sure if this is related, but the log appears to indicate that there was a UDF running just before the failure. Have you been able to consistently reproduce the failure? Have you been able to reproduce the failure intermittently? Is this the one time that you have seen the failure?

Thank you for your time,

-David Meister

martin_ad · March 5, 2015, 5:37pm

This behavior is not expected for server versions 3.13 and above that are on the new clustering protocol (paxos-protocol v5) which enhances the clustering algorithm by not dropping the replica partitions until it is synchronized.

Why replica record count might appear to drop suddenly during migrations

Issue: During migration replica record count appears to change when node is viewed via:

asadm -e info

Detail

When nodes are migrating partitions, the replica record count drops, sometimes severely, an extreme example is shown below.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        Node    Namespace   Avail%   Evictions    Master    Replica     Repl     Stop    Migrates       Disk    Disk     HWM          Mem     Mem    HWM      Stop   
           .            .        .           .   Objects    Objects   Factor   Writes     (tx,rx)       Used   Used%   Disk%         Used   Used%   Mem%   Writes%   
aero-10:3000   persistent       32   206131141   1.754 G    0.000          2   false    (N/E,N/E)   1.841 TB      48      75   210.635 GB      61     65        90   
aero-11:3000   persistent       37           0   1.646 G    0.000          2   false    (N/E,N/E)   1.724 TB      45      75   197.253 GB      57     65        90   
aero-12:3000   persistent       35           0   1.821 G    0.000          2   false    (N/E,N/E)   1.767 TB      46      75   202.210 GB      58     65        90

Explanation

This is expected behaviour. The objects of desynchronised replica partitions are not counted until they become synchronised. This means that if a partition is not the master or acting master and has an inbound migration, the objects in that partition will not be counted, and so we may see a sharp drop in replica record numbers. This can be particularly obvious when a specific node has a high number of outbound migrations. In that scenario the other nodes in the cluster would have a high number of scheduled inbound migrations and the replica objects would not be counted.

This is normal behaviour. The replica record count per node will stabilise once migrations are completed.

meher · March 5, 2015, 7:47pm

Thanks Martin,

Would it be possible to send us reproduction steps (lua code and how it is called)?

Thanks for your help! –meher

martin_ad · March 5, 2015, 9:41pm

here is the lua code:

--
-- Created by IntelliJ IDEA.
-- User: joe
-- Date: 10/12/14
-- Time: 3:49 PM
-- To change this template use File | Settings | File Templates.
--

function deleteAll(r)
    status = aerospike:remove(r)

    debug("deleted record")
    return
end

function deleteTasks(r, jobRunIdTarget)
    if aerospike:exists(r) then
        return 0
    end

    local jobRunId = r['jobRunId']

    if(jobRunId == nil) then
        return 0
    end

    if (jobRunId == jobRunIdTarget) then
        return aerospike:remove(r)
    end

end

local function registerOne(rec)
    return 1
end

local function add(a,b)
    return a+b
end

--PENDING, --SUCCESS,
--FAILED
function getTaskStatistics(s,state)
    local function filterState(rec)
        if (state == "" or state == nil) then
            return true
        end

        return state == rec.state
    end
    return s:filter(filterState):map(registerOne):reduce(add)
end

local function transfromTaskToStatisticUnit(rec)
    local out = map()

    debug('transformTaskToStatisticUnit')
    out['state'] = rec['state'];

    debug('state:' + out['state'])

    return out
end

and here is some sample code:

public GeneratorStatistics getStatstics(String setName, JobRun jobRun){
  int total = getTotal(setName, jobRun);

  GeneratorStatistics statistics = new GeneratorStatistics();
  statistics.setJobRun(jobRun);
  statistics.setTotal(total);

  for (GeneratorTaskState generatorTaskState : GeneratorTaskState.values()) {
    int stateCnt = getStateCount(setName, jobRun, generatorTaskState);

    StateStatistic stateStat = new StateStatistic();
    stateStat.setTotal(stateCnt);
    stateStat.setState(generatorTaskState);

    statistics.getStats().add(stateStat);
  }

  return statistics;
}

and

private int getTotal(String setName, JobRun jobRun){
  Statement stmt = createStatement(
        setName,
        Filter.equal("jobRunId", com.aerospike.client.Value.get(jobRun.getId()))
    );

  int result = 0;

  ResultSet rs = getClient().queryAggregate(null, stmt, UDF_MODULE_NAME, UDF_FUNCTION_TASK_STATISTICS);

  try{
    while(rs.next()){
      Long obj = (Long) rs.getObject();
      result = obj.intValue();
    }
  }catch(Exception e){
    error(e, "Exception thrown in getTotal method");
  } finally{
    rs.close();
  }
  return result;
}

thanks…

meher · March 6, 2015, 2:40am

Thanks! Let us try to reproduce and get back to you.

meher · March 6, 2015, 2:47am

One thing to try actually that may help our investigation:

Could you add the following in your configuration file (at the end, on its own):

mod-lua {
  cache-enabled false
}

and then restart your node(s) and try to reproduce.

Thanks!

–meher

martin_ad · March 6, 2015, 5:46pm

Yes, when the ‘cache-enabled false’ entry is added to the config file, the problem does not happen.

brad · March 7, 2015, 2:55am

Hi Martin,

The crashes are very likely due to a Lua cache corruption bug which we have fixed for the next server release. In the meantime using the ‘cache-enabled false’ setting should be an effective workaround.

Thanks,

-Brad

lming · April 9, 2015, 12:41am

I got SIGSEGV as well. Do we need to set mod-lua > cache-enabled false for current version (3.5.4 CE)?

brad · April 9, 2015, 8:59pm

Version 3.5.4 contains the cache corruption fix, so disabling the cache should not be needed – but you could certainly try it and see if it makes a difference.