Node aborts with null file handle


#1

Node aborts with null file handle

Problem Description

In an Aerospike cluster prior to Aerospike 3.8.3 where migrations are ongoing, nodes abort with the following stack trace:


May 18 2016 22:03:00 GMT: CRITICAL (proto): (transaction.c:316) null file handle
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:96) SIGABRT received, aborting Aerospike Enterprise Edition build 3.8.2.3 os el6
May 18 2016 22:03:00 GMT: WARNING (xdr): (xdr.c:4865) Got Signal 'Aborted'.
May 18 2016 22:03:00 GMT: INFO (xdr): (xdr.c:4868) Digestinfo logged to disk=22182947, unlogged=0
May 18 2016 22:03:00 GMT: INFO (xdr): (xdr.c:4886) Flushed all log records to disk
May 18 2016 22:03:00 GMT: INFO (cf:rbuffer): (xdr.c:4897) Stats [22208393:22182947:344876:397356:123046]
May 18 2016 22:03:00 GMT: INFO (cf:rbuffer): (xdr.c:4897) Current sptr [221825:0] rptr [221829:47] | rctx [221829:47] | wptr [221829:47] | wctx [221829:47]
May 18 2016 22:03:00 GMT: INFO (cf:rbuffer): (xdr.c:4897) Max Seg = 13415065
May 18 2016 22:03:00 GMT: INFO (xdr): (xdr.c:4899) XDR is stopped.
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: found 10 frames
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x3c) [0x4a563f]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 1: /lib64/libc.so.6(+0x35670) [0x7f39b303d670]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x37) [0x7f39b303d5f7]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 3: /lib64/libc.so.6(abort+0x148) [0x7f39b303ece8]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 4: /usr/bin/asd(cf_fault_sink_activate_all_held+0) [0x57b888]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 5: /usr/bin/asd(as_transaction_error+0x2b8) [0x4f2b0f]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 6: /usr/bin/asd(process_transaction+0x8c) [0x4f1545]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 7: /usr/bin/asd(thr_tsvc+0x3f) [0x4f1f9e]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 8: /lib64/libpthread.so.0(+0x7dc5) [0x7f39b4210dc5]
May 18 2016 22:03:00 GMT: WARNING (as): (signal.c:100) stacktrace: frame 9: /lib64/libc.so.6(clone+0x6d) [0x7f39b30fec9d] 

The backtrace will look as follows:

Running 'info line *0x4a563f'
Line 100 of "base/signal.c" starts at address 0x4a562d <as_sig_handle_abort+42> and ends at 0x4a56cd <as_sig_handle_abort+202>.

Running 'info line *0x7f39b303d670'
No line number information available for address 0x7f39b303d670

Running 'info line *0x7f39b303d5f7'
No line number information available for address 0x7f39b303d5f7

Running 'info line *0x7f39b303ece8'
No line number information available for address 0x7f39b303ece8

Running 'info line *0x57b888'
Line 287 of "fault.c" starts at address 0x57b888 <cf_fault_sink_activate_all_held> and ends at 0x57b896 <cf_fault_sink_activate_all_held+14>.

Running 'info line *0x4f2b0f'
Line 316 of "base/transaction.c" starts at address 0x4f2af3 <as_transaction_error+668> and ends at 0x4f2b18 <as_transaction_init_iudf>.

Running 'info line *0x4f1545'
Line 421 of "base/thr_tsvc.c" starts at address 0x4f1545 <process_transaction+140> and ends at 0x4f1555 <process_transaction+156>.

Running 'info line *0x4f1f9e'
Line 438 of "base/thr_tsvc.c" starts at address 0x4f1f9e <thr_tsvc+63> and ends at 0x4f1fb2 <thr_tsvc+83>.

Running 'info line *0x7f39b4210dc5'
No line number information available for address 0x7f39b4210dc5

Running 'info line *0x7f39b30fec9d'
No line number information available for address 0x7f39b30fec9d

Explanation

This error is very rare but can occur when nodes are migrating. If the code responsible for duplicate resolution responses receives a mismatched cluster key it re-queues the transaction. The most likely cause of the mismatched cluster key would be ongoing migrations where records are moving between nodes. If the re-queue happens at same time as the transaction times out, the file descriptor will be cleared and so a null file handle would be passed to the re-queued transaction. As the issue is related to transaction timeouts, if the network is congested, the likelihood of this error appearing are increased, though it is still very rare.

Solution

Although this error is seen only seldomly, it is a bug in versions of Aerospike prior to Aerospike 3.8.3. The resolution is to upgrade to Aerospike 3.8.3 or higher.

Notes

  • This issue is AER-5026 and is resolved in Aerospike 3.8.3 and higher
  • This issue is extremely uncommon

Keywords

AER-5026 NULL FILE HANDLE SIGABRT MIGRATIONS NETWORK CONGESTION

Timestamp

5/26/2016