Experienced sudden crash error

Post by artursocha » Mon Apr 14, 2014 8:32 am

Hi, We have experienced sudden crash of aerospike community (latest version). It was running fine for few days with very small traffic Reads: <150 TPS, Queries <3TPS, Writes ~600-700TPS. By crash I meant that ASD processes literally died.

basic Cluster specs: 2 x node, 1 namespace (disk), ‘return on master write’ enabled

Last recorded state: Postby artursocha » Mon Apr 14, 2014 8:32 am

Hi, We have experienced sudden crash of aerospike community (latest version). It was running fine for few days with very small traffic Reads: <150 TPS, Queries <3TPS, Writes ~600-700TPS. By crash I meant that ASD processes literally died.

basic Cluster specs: 2 x node, 1 namespace (disk), ‘return on master write’ enabled

Last recorded state: Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4484) system memory: free 115569692kb ( 87 percent free ) Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4491) migrates in progress ( 0 , 0 ) ::: ClusterSize 1 ::: objects 211372514 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4499) rec refs 211383126 ::: rec locks 16 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4505) replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4520) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (893, 829271, 828378) : hb 0 : fab 16 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4522) heartbeat_received: self 0 : foreign 6867746 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4535) tree_counts: nsup 1 scan 0 batch 0 dup 15 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4568) namespace tracking: disk inuse: 87084070784 memory inuse: 13527840896 (bytes) sindex memory inuse: 11236401489 (bytes) avail pct 45 cache-read pct 2.99 Apr 14 2014 15:00:07 GMT: INFO (info): (base/thr_info.c:4588) partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0 Apr 14 2014 15:00:07 GMT: INFO (info): (hist.c:55) histogram dump: reads (49332543 total)

and then almost at the same moment we lost 2 nodes: Stacktrace from first node: Apr 14 2014 14:57:07 GMT: WARNING (sindex): (base/thr_query.c:499) Failed to delete qtr from query hash. Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:140) Signal SEGV received: stack trace Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x46) [0x4a83a8] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x324f0) [0x7f498de354f0] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 2: /usr/bin/asd(cf_buf_builder_reserve+0x13) [0x509add] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 3: /usr/bin/asd(as_msg_make_response_bufbuilder+0x663) [0x466ae7] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 4: /usr/bin/asd(as_query__add_response+0x89) [0x4aa4c8] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 5: /usr/bin/asd(as_query__io+0x299) [0x4aba28] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 6: /usr/bin/asd(as_query__process_ioreq+0xd1) [0x4ac247] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 7: /usr/bin/asd(as_query__worker_th+0x66) [0x4adb93] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 8: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f498ea39b50] Apr 14 2014 14:57:07 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 9: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f498dedf0ed]

and from second one: Apr 14 2014 15:00:10 GMT: WARNING (sindex): (base/thr_query.c:499) Failed to delete qtr from query hash. Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:140) Signal SEGV received: stack trace Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 0: [0x4a83a8] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x324f0) [0x7f474f1074f0] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 2: [0x53077b] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 3: [0x536128] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 4: [0x52f36b] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 5: [0x4b19cf] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 6: [0x4b7ad0] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 7: [0x4b7f6d] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 8: [0x4ba8e5] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 9: [0x47df10] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 10: [0x47ff07] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 11: [0x4825ee] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 12: [0x48e81d] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f474fd0ab50] Apr 14 2014 15:00:10 GMT: WARNING (as): (base/signal.c:149) stacktrace: frame 14: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f474f1afa7d]

I would appreciate if you could put more light on this issue as it seems to be very low level.

thanks, Artur

Post by young » Wed May 14, 2014 11:47 am

Can you tell us what version of the OS you are on?

Post by young » Wed May 14, 2014 12:10 pm

We took a look and realized that the problem was caused because stream UDFs were being used and writes were attempted. Currently stream UDFs cannot issue writes into the database and are intended for read-only access. We are changing the server to prevent a crash under these circumstances.

Post by devops02 » Wed May 14, 2014 12:58 pm

Hi Artur,

I was wondering if we can have a follow up on this issue with you? Can we get the steps on what you did to cause this crash? Also if we need to have a deeper conversation we can have a Skype session with our engineers if your available? Since we are intrigued in how this caused your server to crash.

-Jerry

Post by hqwider » Mon Jul 14, 2014 1:47 am

Hi,

We are having the same issue with us again. We have a two node cluster, with CentOS 6.5 and Aerospike version 3.3.5

We can confirm that we are not using Stream UDFs at all. The following is part of the log file before the service stops:

Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::148) Signal SEGV received: stack trace Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x48) [0x4662cd] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 1: /lib64/libc.so.6() [0x31520329a0] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 2: /usr/bin/asd(as_bin_create+0x96) [0x44d61d] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 3: /usr/bin/asd(write_local+0x1511) [0x49ee30] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 4: /usr/bin/asd() [0x4a39be] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 5: /usr/bin/asd(as_rw_start+0x264) [0x4a430d] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 6: /usr/bin/asd(process_transaction+0xbe8) [0x4ade5d] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 7: /usr/bin/asd(thr_tsvc_process_or_enqueue+0x3e) [0x4ae62f] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 8: /usr/bin/asd(thr_demarshal+0x35c) [0x476f27] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 9: /lib64/libpthread.so.0() [0x31524079d1] Jul 14 2014 08:17:01 GMT: WARNING (as): (signal.c::155) stacktrace: frame 10: /lib64/libc.so.6(clone+0x6d) [0x31520e8b6d] [root@audiencedb2 ~]#

Can you help?

Post by devops02 » Tue Jul 15, 2014 10:42 am

Hi,

I was wondering are you using single-bin? Also is your data being stored in memory only or on disk?

Jerry

Post by simonc » Tue Jul 08, 2014 4:46 am

http://www.aerospike.com/community/labs … _data.html