One node cluster visibility false (a node became invisible to other nodes over the network)


#1

running a cluster of 5 node many days, for some reason a node 192.168.3.10 seems became not unvisibile to other nodes yesterday, and other begin busy migrating data, how to fix it?

here is the log of 192.168.3.11

Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (08: 0000001659) (09: 0000000863) (10: 0000000494) (11: 0000000023)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (997416000 total) msec
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (00: 0986079830) (01: 0005451991) (02: 0003527745) (03: 0001803147)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (04: 0000494056) (05: 0000044663) (06: 0000004595) (07: 0000003320)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (08: 0000002151) (09: 0000001677) (10: 0000001505) (11: 0000001281)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::163)  (12: 0000000039)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::137) histogram dump: proxy (31776 total) msec
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (00: 0000027373) (01: 0000001692) (02: 0000000944) (03: 0000000764)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (04: 0000000556) (05: 0000000221) (06: 0000000048) (07: 0000000036)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (08: 0000000067) (09: 0000000051) (10: 0000000022) (11: 0000000002)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (997406653 total) msec
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (00: 0986081074) (01: 0005447987) (02: 0003525934) (03: 0001802350)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (04: 0000493675) (05: 0000044311) (06: 0000004024) (07: 0000002185)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::154)  (08: 0000001776) (09: 0000001677) (10: 0000001305) (11: 0000000325)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::163)  (12: 0000000030)
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Nov 05 2015 03:49:40 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Nov 05 2015 03:49:44 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9dd890bbf926c and self bb91d8c0bbf926c
Nov 05 2015 03:49:44 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9dd890bbf926c,bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 03:49:47 GMT: INFO (hb): (hb.c::2297) HB node bb9dd890bbf926c in different cluster - succession lists don't match
Nov 05 2015 03:49:47 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /dev/sdc1: used 132140629376, contig-free 229221M (1833774 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1087151 (0.1/s), defrag-q 0 defrag-tot 30626 (0.0/s)
Nov 05 2015 03:49:48 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /dev/sdb1: used 132133732480, contig-free 146338M (1170706 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 47527497 (2.2/s), defrag-q 0 defrag-tot 45800717 (2.2/s)
Nov 05 2015 03:49:49 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9dd890bbf926c and self bb91d8c0bbf926c
Nov 05 2015 03:49:49 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9dd890bbf926c,bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4488)  system memory: free 61213396kb ( 46 percent free ) 
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4495)  migrates in progress ( 683 , 0 ) ::: ClusterSize 4 ::: objects 913652595
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4503)  rec refs 913865140 ::: rec locks 1 ::: trees 0 ::: wr reqs 0 ::: mig tx 683 ::: mig rx 1
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4509)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 441657577 :: master 0 
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4519)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (693, 4285200, 4284507) : hb (5, 22, 17) : fab (72, 272, 200)
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4521)    heartbeat_received: self 0 : foreign 1037539361
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4522)    heartbeat_stats: bt 0 bf 414116139 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 22 efd 17 efa 5 um 0 mcf 0 rc 17 
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4535)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 1 migtx 683 ssdr 0 ssdw 0 rw 2
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4566) namespace disk-base: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99 cache-read pct 0.00
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4551) namespace base: disk inuse: 18097664 memory inuse: 9236481 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4566) namespace mbidmapping: disk inuse: 264274383744 memory inuse: 58468326400 (bytes) sindex memory inuse: 0 (bytes) avail pct 40 cache-read pct 17.50
Nov 05 2015 03:49:50 GMT: INFO (info): (thr_info.c::4576)    partitions: actual 3222 sync 2876 desync 109 zombie 0 wait 0 absent 6081
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: reads (72246516007 total) msec
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (00: 72153987829) (01: 0069045017) (02: 0018430232) (03: 0001389542)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (04: 0001581270) (05: 0001589118) (06: 0000479107) (07: 0000010853)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (08: 0000001659) (09: 0000000863) (10: 0000000494) (11: 0000000023)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (997426768 total) msec
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (00: 0986089877) (01: 0005452551) (02: 0003527890) (03: 0001803162)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (04: 0000494057) (05: 0000044663) (06: 0000004595) (07: 0000003320)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (08: 0000002151) (09: 0000001677) (10: 0000001505) (11: 0000001281)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::163)  (12: 0000000039)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: proxy (31776 total) msec
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (00: 0000027373) (01: 0000001692) (02: 0000000944) (03: 0000000764)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (04: 0000000556) (05: 0000000221) (06: 0000000048) (07: 0000000036)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (08: 0000000067) (09: 0000000051) (10: 0000000022) (11: 0000000002)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (997417421 total) msec
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (00: 0986091121) (01: 0005448547) (02: 0003526079) (03: 0001802365)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (04: 0000493676) (05: 0000044311) (06: 0000004024) (07: 0000002185)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::154)  (08: 0000001776) (09: 0000001677) (10: 0000001305) (11: 0000000325)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::163)  (12: 0000000030)
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Nov 05 2015 03:49:50 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Nov 05 2015 03:49:54 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9dd890bbf926c and self bb91d8c0bbf926c
Nov 05 2015 03:49:54 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9dd890bbf926c,bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 03:49:56 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /data/aerospike/data/disk-base.dat: used 0, contig-free 102142M (102142 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 311 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s)
Nov 05 2015 03:49:59 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /data/aerospike/data/base.dat: used 18102016, contig-free 16353M (16353 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 202920 (0.0/s), defrag-q 0 defrag-tot 202891 (0.0/s)
Nov 05 2015 03:49:59 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9dd890bbf926c and self bb91d8c0bbf926c
Nov 05 2015 03:49:59 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9dd890bbf926c,bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4488)  system memory: free 61165676kb ( 46 percent free ) 
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4495)  migrates in progress ( 683 , 0 ) ::: ClusterSize 4 ::: objects 913652945
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4503)  rec refs 913819204 ::: rec locks 2 ::: trees 0 ::: wr reqs 0 ::: mig tx 683 ::: mig rx 0
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4509)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 441671766 :: master 0 
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4519)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (691, 4285220, 4284529) : hb (5, 22, 17) : fab (72, 272, 200)
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4521)    heartbeat_received: self 0 : foreign 1037539694
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4522)    heartbeat_stats: bt 0 bf 414116273 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 22 efd 17 efa 5 um 0 mcf 0 rc 17 
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4535)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 683 ssdr 0 ssdw 0 rw 2
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4566) namespace disk-base: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99 cache-read pct 0.00
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4551) namespace base: disk inuse: 18102784 memory inuse: 9239601 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4566) namespace mbidmapping: disk inuse: 264274471936 memory inuse: 58468346240 (bytes) sindex memory inuse: 0 (bytes) avail pct 40 cache-read pct 17.89
Nov 05 2015 03:50:00 GMT: INFO (info): (thr_info.c::4576)    partitions: actual 3222 sync 2876 desync 109 zombie 0 wait 0 absent 6081
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: reads (72246598164 total) msec
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (00: 72154069935) (01: 0069045053) (02: 0018430246) (03: 0001389543)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (04: 0001581270) (05: 0001589118) (06: 0000479107) (07: 0000010853)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (08: 0000001659) (09: 0000000863) (10: 0000000494) (11: 0000000023)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (997437653 total) msec
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (00: 0986099911) (01: 0005453196) (02: 0003528072) (03: 0001803186)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (04: 0000494057) (05: 0000044663) (06: 0000004595) (07: 0000003320)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (08: 0000002151) (09: 0000001677) (10: 0000001505) (11: 0000001281)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::163)  (12: 0000000039)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: proxy (31776 total) msec
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (00: 0000027373) (01: 0000001692) (02: 0000000944) (03: 0000000764)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (04: 0000000556) (05: 0000000221) (06: 0000000048) (07: 0000000036)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (08: 0000000067) (09: 0000000051) (10: 0000000022) (11: 0000000002)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (997428306 total) msec
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (00: 0986101156) (01: 0005449191) (02: 0003526261) (03: 0001802389)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (04: 0000493676) (05: 0000044311) (06: 0000004024) (07: 0000002185)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::154)  (08: 0000001776) (09: 0000001677) (10: 0000001305) (11: 0000000325)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::163)  (12: 0000000030)
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Nov 05 2015 03:50:00 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Nov 05 2015 03:50:02 GMT: INFO (hb): (hb.c::2297) HB node bb9dd890bbf926c in different cluster - succession lists don't match
Nov 05 2015 03:50:04 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9dd890bbf926c and self bb91d8c0bbf926c
Nov 05 2015 03:50:04 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9dd890bbf926c,bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c

and 192.168.3.10:

Nov 05 2015 04:02:47 GMT: INFO (hb): (hb.c::2297) HB node bb9198c0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:02:47 GMT: INFO (hb): (hb.c::2297) HB node bb99d7e0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:02:47 GMT: INFO (hb): (hb.c::2297) HB node bb91d8c0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:02:49 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9498c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:49 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb91d8c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:49 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9198c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:49 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb99d7e0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:49 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 04:02:52 GMT: INFO (hb): (hb.c::2297) HB node bb9498c0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4488)  system memory: free 71782384kb ( 54 percent free ) 
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4495)  migrates in progress ( 0 , 1 ) ::: ClusterSize 5 ::: objects 749801899
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4503)  rec refs 749801899 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 1
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4509)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 472183903 :: master 0 
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4519)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 41367207, 41367207) : hb (3, 18, 15) : fab (72, 208, 136)
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4521)    heartbeat_received: self 0 : foreign 1023681614
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4522)    heartbeat_stats: bt 0 bf 320541690 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 3 sf3 0 sf4 0 sf5 0 sf6 2 mrf 0 eh 63 efd 12 efa 51 um 0 mcf 0 rc 10 
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4535)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 1 migtx 0 ssdr 0 ssdw 0 rw 0
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4566) namespace disk-base: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99 cache-read pct 0.00
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4551) namespace base: disk inuse: 27797760 memory inuse: 15544077 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4566) namespace mbidmapping: disk inuse: 216845714304 memory inuse: 47976231360 (bytes) sindex memory inuse: 0 (bytes) avail pct 48 cache-read pct 0.00
Nov 05 2015 04:02:53 GMT: INFO (info): (thr_info.c::4576)    partitions: actual 2538 sync 2295 desync 0 zombie 7455 wait 0 absent 0
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: reads (85096746112 total) msec
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (00: 85026403415) (01: 0046710997) (02: 0018140716) (03: 0001477031)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (04: 0001830783) (05: 0001757951) (06: 0000409000) (07: 0000014256)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (08: 0000001403) (09: 0000000225) (10: 0000000092) (11: 0000000126)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::163)  (12: 0000000114) (13: 0000000003)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (18013644516 total) msec
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (00: 17157494480) (01: 0548251493) (02: 0222064595) (03: 0069959624)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (04: 0014182163) (05: 0001565284) (06: 0000087183) (07: 0000009793)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (08: 0000019181) (09: 0000000874) (10: 0000005699) (11: 0000004010)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::163)  (12: 0000000130) (13: 0000000007)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: proxy (145128 total) msec
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (00: 0000144853) (01: 0000000136) (02: 0000000065) (03: 0000000017)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (04: 0000000022) (05: 0000000022) (06: 0000000003) (07: 0000000004)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::163)  (08: 0000000004) (10: 0000000002)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (18013632589 total) msec
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (00: 17158113112) (01: 0547783033) (02: 0221940967) (03: 0069936055)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (04: 0014178880) (05: 0001564786) (06: 0000086621) (07: 0000008666)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::154)  (08: 0000018859) (09: 0000000874) (10: 0000000633) (11: 0000000067)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::163)  (12: 0000000032) (13: 0000000004)
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Nov 05 2015 04:02:53 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Nov 05 2015 04:02:54 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9498c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:54 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb91d8c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:54 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9198c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:54 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb99d7e0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:54 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 04:02:58 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /data/aerospike/data/disk-base.dat: used 0, contig-free 102367M (102367 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 31 (0.0/s), defrag-q 0 defrag-tot 100 (0.0/s)
Nov 05 2015 04:02:59 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9498c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:59 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb91d8c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:59 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9198c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:59 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb99d7e0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:02:59 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 04:03:02 GMT: INFO (hb): (hb.c::2297) HB node bb9198c0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:03:02 GMT: INFO (hb): (hb.c::2297) HB node bb99d7e0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:03:02 GMT: INFO (hb): (hb.c::2297) HB node bb91d8c0bbf926c in different cluster - succession lists don't match
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4488)  system memory: free 71782512kb ( 54 percent free ) 
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4495)  migrates in progress ( 0 , 1 ) ::: ClusterSize 5 ::: objects 749801899
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4503)  rec refs 749801899 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 1
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4509)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 472183903 :: master 0 
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4519)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (0, 41367221, 41367221) : hb (3, 18, 15) : fab (72, 208, 136)
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4521)    heartbeat_received: self 0 : foreign 1023681878
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4522)    heartbeat_stats: bt 0 bf 320541690 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 3 sf3 0 sf4 0 sf5 0 sf6 2 mrf 0 eh 63 efd 12 efa 51 um 0 mcf 0 rc 10 
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4535)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 1 migtx 0 ssdr 0 ssdw 0 rw 0
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4566) namespace disk-base: disk inuse: 0 memory inuse: 0 (bytes) sindex memory inuse: 0 (bytes) avail pct 99 cache-read pct 0.00
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4551) namespace base: disk inuse: 27797760 memory inuse: 15544077 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4566) namespace mbidmapping: disk inuse: 216845714304 memory inuse: 47976231360 (bytes) sindex memory inuse: 0 (bytes) avail pct 48 cache-read pct 0.00
Nov 05 2015 04:03:03 GMT: INFO (info): (thr_info.c::4576)    partitions: actual 2538 sync 2295 desync 0 zombie 7455 wait 0 absent 0
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: reads (85096746112 total) msec
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (00: 85026403415) (01: 0046710997) (02: 0018140716) (03: 0001477031)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (04: 0001830783) (05: 0001757951) (06: 0000409000) (07: 0000014256)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (08: 0000001403) (09: 0000000225) (10: 0000000092) (11: 0000000126)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::163)  (12: 0000000114) (13: 0000000003)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (18013644516 total) msec
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (00: 17157494480) (01: 0548251493) (02: 0222064595) (03: 0069959624)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (04: 0014182163) (05: 0001565284) (06: 0000087183) (07: 0000009793)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (08: 0000019181) (09: 0000000874) (10: 0000005699) (11: 0000004010)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::163)  (12: 0000000130) (13: 0000000007)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: proxy (145128 total) msec
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (00: 0000144853) (01: 0000000136) (02: 0000000065) (03: 0000000017)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (04: 0000000022) (05: 0000000022) (06: 0000000003) (07: 0000000004)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::163)  (08: 0000000004) (10: 0000000002)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (18013632589 total) msec
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (00: 17158113112) (01: 0547783033) (02: 0221940967) (03: 0069936055)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (04: 0014178880) (05: 0001564786) (06: 0000086621) (07: 0000008666)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::154)  (08: 0000018859) (09: 0000000874) (10: 0000000633) (11: 0000000067)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::163)  (12: 0000000032) (13: 0000000004)
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Nov 05 2015 04:03:03 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Nov 05 2015 04:03:04 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9498c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:03:04 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb91d8c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:03:04 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb9198c0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:03:04 GMT: INFO (paxos): (paxos.c::2227) Cluster Integrity Check: Detected succession list discrepancy between node bb99d7e0bbf926c and self bb9dd890bbf926c
Nov 05 2015 04:03:04 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c
Nov 05 2015 04:03:04 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /data/aerospike/data/base.dat: used 27797760, contig-free 16338M (16338 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 248276 (0.0/s), defrag-q 0 defrag-tot 248297 (0.0/s)
Nov 05 2015 04:03:04 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /dev/sdb1: used 108425960960, contig-free 176320M (1410561 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 46546324 (0.0/s), defrag-q 0 defrag-tot 45058725 (0.0/s)
Nov 05 2015 04:03:06 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /dev/sdc1: used 108419753344, contig-free 244520M (1956166 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 1096060 (0.0/s), defrag-q 0 defrag-tot 161843 (0.0/s)
Nov 05 2015 04:03:07 GMT: INFO (hb): (hb.c::2297) HB node bb9498c0bbf926c in different cluster - succession lists don't match

config of 3.10

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 20
	transaction-queues 20
	transaction-threads-per-queue 3
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /data/logs/aerospike/aerospike.log {
		context any info
	}
}

network {
	service {
		address any
		port 3000
	}

	heartbeat {

		# To use unicast-mesh heartbeats, comment out the 3 lines above and
		# use the following 4 lines instead.
		mode mesh
		port 3002
		mesh-address 192.168.3.11 
		mesh-port 3002

		interval 150
		timeout 10
	}

	fabric {
		port 3001
	}

	info {
		port 3003
	}
}

namespace disk-base {
        replication-factor 2
        memory-size 10G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

#        storage-engine memory

        # To use file storage backing, comment out the line above and use the
        # following lines instead.
       storage-engine device {
               file /data/aerospike/data/disk-base.dat
               filesize 100G
               data-in-memory false # Store data in memory in addition to file.
       }
}

namespace base {
        replication-factor 2
        memory-size 10G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

#        storage-engine memory

        # To use file storage backing, comment out the line above and use the
        # following lines instead.
       storage-engine device {
               file /data/aerospike/data/base.dat
               filesize 16G
               data-in-memory true # Store data in memory in addition to file.
       }
}


namespace mbidmapping {
	replication-factor 2
	memory-size 100G 
	default-ttl 365d  

	storage-engine device {
		device /dev/sdb1
		device /dev/sdc1

		scheduler-mode noop
		write-block-size 128K

		data-in-memory false
	}
}

#2

Cluster integrity faults occur when the node cannot see other nodes over the network. I note here you are using mesh and have a single node specified. You could improve resilience by adding the other nodes in your cluster to the heartbeat stanza.

As you’re probably aware, the migrations occurred due to the node leaving the cluster and the other nodes rebalancing.


#3

Are you running on a cloud environment by any chance?


#4
Monitor> info
===NODES===
2016-06-06 13:37:26.097532
Sorting by IP, in Ascending order:
ip:port               Build   Cluster      Cluster   Free   Free   Migrates              Node         Principal      Replicated    Sys
                          .      Size   Visibility   Disk    Mem          .                ID                ID         Objects   Free
                          .         .            .    pct    pct          .                 .                 .               .    Mem
192.168.3.10:3000    3.7.4         2        false     69     52      (0,0)   BB9DD890BBF926C   BB9DD890BBF926C   1,018,855,449     36
192.168.3.11:3000    3.7.4         3        false     69     51      (1,0)   BB91D8C0BBF926C   BB9498C0BBF926C   1,039,056,034     35
192.168.3.12:3000    3.7.4         5         true     69     52      (0,0)   BB99D7E0BBF926C   BB99D7E0BBF926C   1,019,887,173     35
192.168.3.8:3000     3.7.4         3        false     69     52      (1,1)   BB9198C0BBF926C   BB9498C0BBF926C   1,023,852,371     46
192.168.3.9:3000     3.7.4         3        false     68     51      (1,2)   BB9498C0BBF926C   BB9498C0BBF926C   1,040,798,766     46
Number of nodes displayed: 5
Monitor> latency
	====writes_master====
                                   timespan   ops/sec   >1ms   >8ms   >64ms
192.168.3.10:3000    05:23:56-GMT->05:24:06       0.0   0.00   0.00    0.00
192.168.3.11:3000    05:23:54-GMT->05:24:04     870.8   3.54   0.73    0.46
192.168.3.12:3000    05:23:57-GMT->05:24:07       0.0   0.00   0.00    0.00
192.168.3.8:3000     05:23:58-GMT->05:24:08     859.2   2.20   0.01    0.00
192.168.3.9:3000     05:23:56-GMT->05:24:06     514.3   1.81   0.16    0.08


	====reads====
                                   timespan   ops/sec   >1ms   >8ms   >64ms
192.168.3.10:3000    05:23:56-GMT->05:24:06       2.1   0.00   0.00    0.00
192.168.3.11:3000    05:23:54-GMT->05:24:04   43206.6   0.46   0.11    0.07
192.168.3.12:3000    05:23:57-GMT->05:24:07       0.0   0.00   0.00    0.00
192.168.3.8:3000     05:23:58-GMT->05:24:08   39239.5   0.45   0.12    0.00
192.168.3.9:3000     05:23:56-GMT->05:24:06   34063.8   0.54   0.13    0.09

not in the cloud environment this happen again recently, after some network problem the cluster was broken and we must restart node 192.168.3.12:3000 and 192.168.3.10:3000 to fix it, is there any way avoid to restart ?


#5

My recommendation would be to upgrade to the latest version. There have been improvements for nodes/clusters to recover after network perturbation.

  • As of release 3.7.0.1 a new paxos-recovery-policy can be configured: auto-reset-master, which should help.

  • As of release 3.8.1 this policy has been made default.

I would recommend upgrading to the latest 3.8.3 release.


#6

You can also try running the commands specified in the log file when this happens… for example:

Nov 05 2015 04:03:04 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb99d7e0bbf926c,bb9498c0bbf926c,bb91d8c0bbf926c,bb9198c0bbf926c

It will then give you the command for the Phase 2 of 2 (to undun). But this doesn’t always work. Again, recommendation would be to upgrade to the latest release :slight_smile: