Critical SIGABRT error

container
docker

#1

Hi again, our two node cluster just died (both nodes simultaneously!) with a SIGABRT. Both are running in Docker Container with Debian Squeeze as host. Aerospike Enterprise Edition build 3.5.12.

Here is the log:

Jun 19 2015 23:15:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /opt/aerospike/data/maps.dat: used 25088, contig-free 10233M (10233 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 5 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s)
Jun 19 2015 23:15:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /opt/aerospike/data/user.dat: used 17408, contig-free 10238M (10238 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 1 (0.0/s)
Jun 19 2015 23:15:21 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /opt/aerospike/data/product.dat: used 105384866432, contig-free 279329M (279329 wblocks), swb-free 16, n-w 0, w-q 0 w-tot 153244 (0.5/s), defrag-q 0 defrag-tot 36681 (0.2/s)
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4796)  system memory: free 58408060kb ( 88 percent free ) 
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 117355580 ::: sub_objects 0
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4812)  rec refs 117355580 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (84, 8939, 8855) : hb (4, 4, 0) : fab (30, 30, 0)
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 213025 : foreign 213043
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 176980 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4885) namespace product: disk inuse: 105384866432 memory inuse: 7510751680 (bytes) sindex memory inuse: 0 (bytes) avail pct 69 cache-read pct 1.86
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4885) namespace maps: disk inuse: 25088 memory inuse: 2816 (bytes) sindex memory inuse: 0 (bytes) avail pct 99 cache-read pct 0.00
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4860) namespace user: disk inuse: 17408 memory inuse: 12470 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Jun 19 2015 23:15:22 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 6120 sync 6168 desync 0 zombie 0 wait 0 absent 0
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: reads (9313153 total) msec
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::154)  (00: 0009080137) (01: 0000069890) (02: 0000068304) (03: 0000066911)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::154)  (04: 0000025265) (05: 0000001276) (06: 0000001019) (07: 0000000341)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::163)  (08: 0000000010)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (312554 total) msec
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::154)  (00: 0000053848) (01: 0000043211) (02: 0000052796) (03: 0000079480)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::154)  (04: 0000075319) (05: 0000007479) (06: 0000000369) (07: 0000000047)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::163)  (08: 0000000005)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (4616575 total) msec
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::154)  (00: 0004354682) (01: 0000045140) (02: 0000053622) (03: 0000079821)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::154)  (04: 0000075381) (05: 0000007495) (06: 0000000382) (07: 0000000047)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::163)  (08: 0000000005)
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jun 19 2015 23:15:22 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jun 19 2015 23:15:22 GMT: CRITICAL (demarshal): (thr_demarshal.c:thr_demarshal:336) accept: Software caused connection abort (errno 103)
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Enterprise Edition build 3.5.12
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: found 8 frames
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x54) [0x489e03]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x321e0) [0x7f32257f81e0]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f32257f8165]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f32257fb3e0]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x229) [0x51da3b]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(thr_demarshal+0x5ef) [0x49f2ea]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f3226612b50]
Jun 19 2015 23:15:22 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f32258a195d]

Any idea what might have caused this? Let us know if you need any more information.


#2

Looks like errno 103 some sort of socket error.

More info on ECONNABORTED error code 103:

http://stackoverflow.com/questions/9646550/what-does-econnaborted-mean-when-trying-to-connect-a-socket

Was there any network disturbance on the host machine?


#3

Thanks lucien. We are unable to verify if there was a short network disturbance (nothing in any logs/monitoring). In any case, why should such an event cause the Aerospike Cluster to crash?


#4

We’ll need to look deeper into the root cause. Could you provide your dockerfile and let us which kernel and OS docker is running on? Also is this running on multicast?


#5

Sure! Is there any way to send you this information more privately?


#6

Done. Please check your email!


#7

Looks like you are using Debian Squeeze (Debian 6) as per your initial post.

docker.io sites recommend the following for running Docker on Debian:

Docker is supported on the following versions of Debian:

Debian 8.0 Jessie (64-bit) Debian 7.7 Wheezy (64-bit)

Docker requires Kernel 3.8+, while Wheezy ships with Kernel 3.2 (for more details on why 3.8 is required, see discussion on bug Docker bug#407.)

Would it be possible for you to upgrade the OS?