Thank you for writing. Before I get started, I can see that you are using the 3.4.0 Community version. Would you confirm the O/S and version where you are running Aerospike?
I’m not yet sure if this is related, but the log appears to indicate that there was a UDF running just before the failure. Have you been able to consistently reproduce the failure? Have you been able to reproduce the failure intermittently? Is this the one time that you have seen the failure?
This behavior is not expected for server versions 3.13 and above that are on the new clustering protocol (paxos-protocol v5) which enhances the clustering algorithm by not dropping the replica partitions until it is synchronized.
Why replica record count might appear to drop suddenly during migrations
Issue: During migration replica record count appears to change when node is viewed via:
asadm -e info
Detail
When nodes are migrating partitions, the replica record count drops, sometimes severely, an extreme example is shown below.
This is expected behaviour. The objects of desynchronised replica partitions are not counted until they become synchronised. This means that if a partition is not the master or acting master and has an inbound migration, the objects in that partition will not be counted, and so we may see a sharp drop in replica record numbers. This can be particularly obvious when a specific node has a high number of outbound migrations. In that scenario the other nodes in the cluster would have a high number of scheduled inbound migrations and the replica objects would not be counted.
This is normal behaviour. The replica record count per node will stabilise once migrations are completed.
--
-- Created by IntelliJ IDEA.
-- User: joe
-- Date: 10/12/14
-- Time: 3:49 PM
-- To change this template use File | Settings | File Templates.
--
function deleteAll(r)
status = aerospike:remove(r)
debug("deleted record")
return
end
function deleteTasks(r, jobRunIdTarget)
if aerospike:exists(r) then
return 0
end
local jobRunId = r['jobRunId']
if(jobRunId == nil) then
return 0
end
if (jobRunId == jobRunIdTarget) then
return aerospike:remove(r)
end
end
local function registerOne(rec)
return 1
end
local function add(a,b)
return a+b
end
--PENDING, --SUCCESS,
--FAILED
function getTaskStatistics(s,state)
local function filterState(rec)
if (state == "" or state == nil) then
return true
end
return state == rec.state
end
return s:filter(filterState):map(registerOne):reduce(add)
end
local function transfromTaskToStatisticUnit(rec)
local out = map()
debug('transformTaskToStatisticUnit')
out['state'] = rec['state'];
debug('state:' + out['state'])
return out
end
and here is some sample code:
public GeneratorStatistics getStatstics(String setName, JobRun jobRun){
int total = getTotal(setName, jobRun);
GeneratorStatistics statistics = new GeneratorStatistics();
statistics.setJobRun(jobRun);
statistics.setTotal(total);
for (GeneratorTaskState generatorTaskState : GeneratorTaskState.values()) {
int stateCnt = getStateCount(setName, jobRun, generatorTaskState);
StateStatistic stateStat = new StateStatistic();
stateStat.setTotal(stateCnt);
stateStat.setState(generatorTaskState);
statistics.getStats().add(stateStat);
}
return statistics;
}
and
private int getTotal(String setName, JobRun jobRun){
Statement stmt = createStatement(
setName,
Filter.equal("jobRunId", com.aerospike.client.Value.get(jobRun.getId()))
);
int result = 0;
ResultSet rs = getClient().queryAggregate(null, stmt, UDF_MODULE_NAME, UDF_FUNCTION_TASK_STATISTICS);
try{
while(rs.next()){
Long obj = (Long) rs.getObject();
result = obj.intValue();
}
}catch(Exception e){
error(e, "Exception thrown in getTotal method");
} finally{
rs.close();
}
return result;
}
The crashes are very likely due to a Lua cache corruption bug which we have fixed for the next server release. In the meantime using the ‘cache-enabled false’ setting should be an effective workaround.
Version 3.5.4 contains the cache corruption fix, so disabling the cache should not be needed – but you could certainly try it and see if it makes a difference.