Post by artursocha » Mon Apr 14, 2014 8:32 am
Hi, We have experienced sudden crash of aerospike community (latest version). It was running fine for few days with very small traffic Reads: <150 TPS, Queries <3TPS, Writes ~600-700TPS. By crash I meant that ASD processes literally died.
basic Cluster specs: 2 x node, 1 namespace (disk), ‘return on master write’ enabled
Last recorded state: Postby artursocha » Mon Apr 14, 2014 8:32 am
Hi, We have experienced sudden crash of aerospike community (latest version). It was running fine for few days with very small traffic Reads: <150 TPS, Queries <3TPS, Writes ~600-700TPS. By crash I meant that ASD processes literally died.
basic Cluster specs: 2 x node, 1 namespace (disk), ‘return on master write’ enabled
Last recorded state: Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4484) system memory: free 115569692kb ( 87 percent free ) Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4491) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 211372514 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4499) rec refs 211383126 ::: rec locks 16 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4505) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4520) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (893, 829271, 828378) : hb 0 : fab 16 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4522) heartbeat_received: self 0 : foreign 6867746 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4535) tree_counts: nsup 1 scan 0 batch 0 dup 15 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4568) namespace tracking: disk inuse: 87084070784 memory inuse: 13527840896 (bytes) sindex memory inuse: 11236401489 (bytes) avail pct 45 cache-read pct 2.99 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4588) partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0 Apr 14 2014 15:00:07 GMT: INFO (info): (hist.c:55) histogram dump: reads (49332543 total)
and then almost at the same moment we lost 2 nodes: Stacktrace from first node: Apr 14 2014 14:57:07 GMT: WARNING (sindex): (base/thr_query.c:499) Failed to delete qtr from query hash. Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:140) Signal SEGV received: stack trace Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x46) [0x4a83a8] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x324f0) [0x7f498de354f0] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 2: /usr/bin/asd(cf_buf_builder_reserve+0x13) [0x509add] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 3: /usr/bin/asd(as_msg_make_response_bufbuilder+0x663) [0x466ae7] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 4: /usr/bin/asd(as_query__add_response+0x89) [0x4aa4c8] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 5: /usr/bin/asd(as_query__io+0x299) [0x4aba28] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 6: /usr/bin/asd(as_query__process_ioreq+0xd1) [0x4ac247] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 7: /usr/bin/asd(as_query__worker_th+0x66) [0x4adb93] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 8: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f498ea39b50] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 9: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f498dedf0ed]
and from second one: Apr 14 2014 15:00:10 GMT: WARNING (sindex): (base/thr_query.c:499) Failed to delete qtr from query hash. Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:140) Signal SEGV received: stack trace Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 0: [0x4a83a8] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x324f0) [0x7f474f1074f0] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 2: [0x53077b] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 3: [0x536128] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 4: [0x52f36b] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 5: [0x4b19cf] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 6: [0x4b7ad0] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 7: [0x4b7f6d] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 8: [0x4ba8e5] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 9: [0x47df10] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 10: [0x47ff07] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 11: [0x4825ee] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 12: [0x48e81d] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f474fd0ab50] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 14: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f474f1afa7d]
I would appreciate if you could put more light on this issue as it seems to be very low level.
thanks, Artur