Scans suddenly stop responding


#1

Hi, I have 3 aerospike nodes installed on linux OS.

I am working with a Java client from a linux machine.

Aerospike Version: Enterprise Edition build 3.5.15

In a certain point of time, one of the aerospike nodes stops responding to scans.

This is a grep of “scan job received” and “SCAN JOB DONE” in the aerospike.log:

Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23597241221030: ns= regional set=subscribers scanned=6539 expired=0 set_diff=0 elapsed=15 (ms)]
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23597268704158: ns= regional set=subscribers scanned=6539 expired=0 set_diff=0 elapsed=15 (ms)]
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23597295959600: ns= regional set=subscribers scanned=6539 expired=0 set_diff=0 elapsed=16 (ms)]
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23597323034699: ns= regional set=subscribers scanned=6539 expired=0 set_diff=0 elapsed=16 (ms)]
Jan 20 2016 13:26:05 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:06 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23597350354171: ns= regional set=subscribers scanned=6539 expired=0 set_diff=0 elapsed=15 (ms)]
Jan 20 2016 13:26:06 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:06 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23597376914686: ns= regional set=subscribers scanned=6539 expired=0 set_diff=0 elapsed=15 (ms)]
Jan 20 2016 13:26:09 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:09 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23604106355837: ns= global set=status scanned=9 expired=0 set_diff=126 elapsed=13 (ms)]
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23609107672641: ns= global set=status scanned=9 expired=0 set_diff=126 elapsed=13 (ms)]
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:37 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:55 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:06 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:13 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:23 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:31 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:40 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:49 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:55 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:27:59 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:07 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:12 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:16 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:26 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:29 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:33 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:44 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:44 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:28:48 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:01 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:02 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:05 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:18 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:20 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:22 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:33 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:34 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:29:37 GMT: INFO (scan): (thr_tscan.c::871) scan job received

This is the output of the last succesfull scan + 2 following failed scans, and the log lines around it:

Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::922) scan_option 0x8 0x64
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::979) NO bins specified select all
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::1013) scan option: Fail if cluster change True
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::1014) scan option: Background Job False
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::1015) scan option: priority is 0 n_threads 1 job_type 1
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::1016) scan option: scan_pct is 100 
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::387) SCAN JOB DONE  [id =23609107672641: ns= global set=status scanned=9 expired=0 set_diff=126 elapsed=13 (ms)]
Jan 20 2016 13:26:14 GMT: INFO (scan): (thr_tscan.c::1459) Scan Job 23609107672641: send final message: fd 99 result 0
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::922) scan_option 0x0 0x64
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::1013) scan option: Fail if cluster change False
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::1014) scan option: Background Job False
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::1015) scan option: priority is 0 n_threads 1 job_type 1
Jan 20 2016 13:26:17 GMT: INFO (scan): (thr_tscan.c::1016) scan option: scan_pct is 100 
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::922) scan_option 0x0 0x64
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::1013) scan option: Fail if cluster change False
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::1014) scan option: Background Job False
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::1015) scan option: priority is 0 n_threads 1 job_type 1
Jan 20 2016 13:26:18 GMT: INFO (scan): (thr_tscan.c::1016) scan option: scan_pct is 100 
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::871) scan job received
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::922) scan_option 0x8 0x64
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::979) NO bins specified select all
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::1013) scan option: Fail if cluster change True
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::1014) scan option: Background Job False
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::1015) scan option: priority is 0 n_threads 1 job_type 1
Jan 20 2016 13:26:19 GMT: INFO (scan): (thr_tscan.c::1016) scan option: scan_pct is 100 
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4796)  system memory: free 10559952kb ( 64 percent free ) 
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 14155 ::: sub_objects 0
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4812)  rec refs 14155 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 2168281804 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (30, 461, 431) : hb (3, 8, 5) : fab (42, 77, 35)
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 679844
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 452716 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 5 efd 4 efa 1 um 0 mcf 0 rc 5 
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4860) namespace global: disk inuse: 235648 memory inuse: 145538 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4860) namespace regional: disk inuse: 14065792 memory inuse: 10161116 (bytes) sindex memory inuse: 538232 (bytes) avail pct 99
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4860) namespace local: disk inuse: 512 memory inuse: 222 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Jan 20 2016 13:26:20 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 4053 sync 6848 desync 0 zombie 0 wait 0 absent 1387
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: reads (9392607 total) msec
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::154)  (00: 0009392428) (01: 0000000138) (02: 0000000031) (03: 0000000008)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (06: 0000000001)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (2607768 total) msec
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::154)  (00: 0002602724) (01: 0000003383) (02: 0000000738) (03: 0000000557)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::154)  (04: 0000000292) (05: 0000000060) (06: 0000000002) (07: 0000000001)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::163)  (08: 0000000005) (09: 0000000006)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (4203086 total) msec
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::154)  (00: 0004196092) (01: 0000004486) (02: 0000000976) (03: 0000000766)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::154)  (04: 0000000554) (05: 0000000194) (06: 0000000006) (07: 0000000001)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::163)  (08: 0000000005) (09: 0000000006)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: query (2304 total) msec
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::163)  (00: 0000002304)
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (2304 total) count
Jan 20 2016 13:26:20 GMT: INFO (info): (hist.c::163)  (00: 0000000480) (01: 0000001056) (02: 0000000768)
Jan 20 2016 13:26:22 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /var/log/contextream/xl/aerospike_data/global.dat: used 235648, contig-free 498M (3988 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 188 (0.0/s), defrag-q 0 defrag-tot 182 (0.0/s)
Jan 20 2016 13:26:22 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /var/log/contextream/xl/aerospike_data/regional.dat: used 14065792, contig-free 8166M (65329 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 94629 (0.9/s), defrag-q 0 defrag-tot 94431 (0.9/s)
Jan 20 2016 13:26:22 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /var/log/contextream/xl/aerospike_data/local.dat: used 512, contig-free 498M (3991 wblocks), swb-free 1, n-w 0, w-q

This is the exception in the client log: Failed database scan- 'Error Code 11: Server not available' (17235972)

After restarting the aerospike node, it started working well again.


#2

Error Code 11 is for server not accepting requests.

If issue re-occurrs, can you confirm if port 3000 is accessible on all nodes in the cluster from the Java client when you see this error.


#3

As you can see in the aerospike.log, there server recieved the scans, but didn’t do anything with them.


#4

Yes. The server clearly was still running.

My concern is if the client could communicate to port 3000 at that time. Maybe checking it using telnet?

Did you also see the scans getting queued up as jobs?


#5

I will check this if the problem will reproduce.

We upgraded to the latest Java Client and Aerospike Server, and the problem still didn’t reproduce.