Hello,
Today one of our cluster member crashed. Were running two nodes, one of it went offline.
Were running on the latest version (3.7.0.2 enterprise).
I attached you all logfiles / configs we got. Please investigate.
Hello,
Today one of our cluster member crashed. Were running two nodes, one of it went offline.
Were running on the latest version (3.7.0.2 enterprise).
I attached you all logfiles / configs we got. Please investigate.
We will investigate on the logs and the stack traces. From a quick look, it seems that the cluster was showing Cluster integrity faults about 4 minutes prior to the crash and the node that crashed was running as a 1 node cluster. Was there anything unexpected triggered on the other node?
Starting 3.7.0.2, we have made improvements in the paxos algorithm implementation and we recommend the configuration paxos-recovery-policy
to 'auto-reset-master'
if cluster is sensitive to network blips.
http://www.aerospike.com/docs/reference/configuration/#paxos-recovery-policy
In order to investigate further, could you please share information on the features you are currently using - UDF, Scans, Batch operations and if anything changed close to when the node crashed?
Are you currently deployed on AWS/similar or bare-metal?
We identified a fix for the SegV that you observed that missed the release that you used. It has gone out in release 3.7.1. Please give it a spin and let us know if you run into issues.
[AER-4487], [AER-4690] - (Clustering/Migration) Race condition causing incorrect heartbeat fd saved and later not removable.
Can you indicate what type of access pattern you have? purely put/gets? batches? scan? secondary index?
It was unexpectedly starting to swap its memory (not caused by Aerospike).
Bare metal.
~70% read, 30% write. Only some batch gets and scans per hour. No secondary index.
Will do!