Cluster crashes after running XDR info command or collectinfo


#1

Cluster crashes after running XDR info command or collectinfo

Problem Description

A cluster which is at Aerospike 3.8.0 or 3.8.1.2 crashes with the following stack trace.

Apr 29 2016 21:57:34 GMT: INFO (xdr): (xdr.c:4856) XDR is not running.
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: found 7 frames
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x3a) [0x4a5a16]
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 1: /lib64/libc.so.6() [0x3b000326a0]
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 2: /usr/bin/asd(xdr_get_dc_stats+0x2ef) [0x541b8a]
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 3: /usr/bin/asd(info_some+0x2e8) [0x4bfa6a]
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 4: /usr/bin/asd(thr_info_fn+0x26d) [0x4bfeb1]
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 5: /lib64/libpthread.so.0() [0x3b00407aa1]
Apr 29 2016 21:57:34 GMT: WARNING (as): (signal.c:185) stacktrace: frame 6: /lib64/libc.so.6(clone+0x6d) [0x3b000e893d]

XDR is not running on the cluster. The following commands may have been run:

$ sudo asadm -e collectinfo

or

$ asinfo -v dc/REMOTE_DC1 -l

Explanation

This crash is the result of a bug in early releases of Aerospike 3.8, AER-4972 which is fixed in Aerospike 3.8.2.2. The issue will manifest in the followign circumstances:

  • XDR stanza in aerospike.conf has enable xdr false
  • XDR stanza also has a data center defined with valid entries (i.e. not a skeleton definition)

The following XDR stanza would produce the issue:

xdr {
        enable-xdr false 
        xdr-digestlog-path /opt/aerospike/xdr/digestlog 100G 

        datacenter REMOTE_DC1 {
                dc-node-address-port 10.0.0.100 3000
                dc-node-address-port 10.0.0.101 3000
                dc-node-address-port 10.0.0.102 3000
        }
}

If the info command listed above or collectinfo (which runs the command implicitly) is run, nodes in the cluster will crash.

Solution

The solution to this issue is to upgrade to Aerospike 3.8.2.2 or higher where this bug is fixed. If this is not possible, data center defintions within the XDR stanza should be commented out, or xdr should be enabled (can still be disabled at the namespace level) so that if collectinfo runs, the cluster does not crash.

Notes

  • AER-4972
  • Fixed in Aerospike 3.8.2.2 and higher

Keywords

XDR COLLECTINFO INFO CRASH NODE CLUSTER FAILURE SEGFAULT 3.8.1 3.8.0 3.8.1.2

Timestamp

7/21/16