Aerospike Crash


#1

Hello,

Today one of our cluster member crashed. Were running two nodes, one of it went offline.

Were running on the latest version (3.7.0.2 enterprise).

I attached you all logfiles / configs we got. Please investigate.

https://dl.dropboxusercontent.com/u/5982366/logs.rar


#2

We will investigate on the logs and the stack traces. From a quick look, it seems that the cluster was showing Cluster integrity faults about 4 minutes prior to the crash and the node that crashed was running as a 1 node cluster. Was there anything unexpected triggered on the other node?

Starting 3.7.0.2, we have made improvements in the paxos algorithm implementation and we recommend the configuration paxos-recovery-policy to 'auto-reset-master' if cluster is sensitive to network blips.

http://www.aerospike.com/docs/reference/configuration/#paxos-recovery-policy

In order to investigate further, could you please share information on the features you are currently using - UDF, Scans, Batch operations and if anything changed close to when the node crashed?

Are you currently deployed on AWS/similar or bare-metal?


#3

We identified a fix for the SegV that you observed that missed the release that you used. It has gone out in release 3.7.1. Please give it a spin and let us know if you run into issues.

[AER-4487], [AER-4690] - (Clustering/Migration) Race condition causing incorrect heartbeat fd saved and later not removable.


#4

Can you indicate what type of access pattern you have? purely put/gets? batches? scan? secondary index?


#5

It was unexpectedly starting to swap its memory (not caused by Aerospike).

Bare metal.

~70% read, 30% write. Only some batch gets and scans per hour. No secondary index.

Will do!