Cluster integrity fault


#1

I am running a cluster with 7 nodes with version 3.7.1 and I got into this issue: CLUSTER INTEGRITY FAULT

One of the nodes went down out of nothing with this errors in log. How to fix it?

Jan 23 2016 07:37:57 GMT: INFO (paxos): (paxos.c::2626) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142
Jan 23 2016 07:37:57 GMT: INFO (paxos): (paxos.c::2747) as_paxos_retransmit_check: node bb90500f00a0142 retransmitting partition sync request to principal bb91800f00a0142 ... 
Jan 23 2016 07:37:57 GMT: INFO (paxos): (paxos.c::2331) Sent partition sync request to node bb91800f00a0142
Jan 23 2016 07:37:57 GMT: INFO (hb): (hb.c::2502) HB node bb90e00f00a0142 in different cluster - succession lists don't match
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91800f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb90e00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91b00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91700f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb90d00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:02 GMT: INFO (hb): (hb.c::3015) Marking node add for paxos recovery: bb90e00f00a0142
Jan 23 2016 07:38:02 GMT: INFO (hb): (hb.c::3015) Marking node add for paxos recovery: bb91b00f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2626) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2747) as_paxos_retransmit_check: node bb90500f00a0142 retransmitting partition sync request to principal bb91800f00a0142 ... 
Jan 23 2016 07:38:02 GMT: INFO (paxos): (paxos.c::2331) Sent partition sync request to node bb91800f00a0142
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5084)  system memory: free 5352548kb ( 69 percent free ) 
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5090)  ClusterSize 6 ::: objects 6912570 ::: sub_objects 0
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5099)  rec refs 6912577 ::: rec locks 7 ::: trees 0 ::: wr reqs 4 ::: mig tx 0 ::: mig rx 0
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5104)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0 
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5114)    trans_in_progress: wr 2 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (65, 18541, 18476) : hb (9, 48, 39) : fab (74, 259, 185)
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5116)    heartbeat_received: self 0 : foreign 46669589
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5117)    heartbeat_stats: bt 0 bf 10559044 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 7 efd 7 efa 0 um 0 mcf 21 rc 39 
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5129)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 11
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5158) {test} disk bytes used 45658315776 : avail pct 80 : cache-read pct 0.00
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5160) {test} memory bytes used 442404480 (index 442404480 : sindex 0) : used pct 8.24
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5171) {test} ldt_gc: cnt 0 io 0 gc 0 (0, 0, 0)
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5194) {test} migrations - remaining (200 tx, 306 rx), active (0 tx, 0 rx), 85.92% complete
Jan 23 2016 07:38:02 GMT: INFO (info): (thr_info.c::5203)    partitions: actual 691 sync 1376 desync 0 zombie 0 absent 2029
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::137) histogram dump: reads (5952059623 total) msec
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (00: 1076176185) (01: 1216175296) (02: 2138006062) (03: 1335862245)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (04: 0143882662) (05: 0025648576) (06: 0009097292) (07: 0002262401)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (08: 0001502285) (09: 0001599065) (10: 0001483058) (11: 0000363880)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::163)  (12: 0000000616)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (5120364 total) msec
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (00: 0000261874) (01: 0000760077) (02: 0000963408) (03: 0000984072)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (04: 0001053068) (05: 0000693503) (06: 0000328111) (07: 0000064856)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (08: 0000009837) (09: 0000001387) (10: 0000000162) (11: 0000000009)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::137) histogram dump: proxy (2087 total) msec
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (00: 0000000012) (01: 0000000007) (02: 0000000007) (03: 0000000045)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::154)  (04: 0000000482) (05: 0000000523) (06: 0000000466) (07: 0000000279)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::163)  (08: 0000000181) (09: 0000000085)
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jan 23 2016 07:38:02 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jan 23 2016 07:38:04 GMT: INFO (hb): (hb.c::2502) HB node bb91b00f00a0142 in different cluster - succession lists don't match
Jan 23 2016 07:38:04 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /data/aerospike.dat: used 45658315776, contig-free 247607M (247607 wblocks), swb-free 16, w-q 0 w-tot 1440379 (0.0/s), defrag-q 0 defrag-tot 1503519 (0.0/s) defrag-w-tot 684864 (0.0/s)
Jan 23 2016 07:38:06 GMT: INFO (hb): (hb.c::2502) HB node bb90d00f00a0142 in different cluster - succession lists don't match
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91800f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb90e00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91b00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91700f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb90d00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:07 GMT: INFO (hb): (hb.c::3015) Marking node add for paxos recovery: bb90e00f00a0142
Jan 23 2016 07:38:07 GMT: INFO (hb): (hb.c::3015) Marking node add for paxos recovery: bb91b00f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2626) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2747) as_paxos_retransmit_check: node bb90500f00a0142 retransmitting partition sync request to principal bb91800f00a0142 ... 
Jan 23 2016 07:38:07 GMT: INFO (paxos): (paxos.c::2331) Sent partition sync request to node bb91800f00a0142
Jan 23 2016 07:38:09 GMT: INFO (hb): (hb.c::2502) HB node bb91700f00a0142 in different cluster - succession lists don't match
Jan 23 2016 07:38:12 GMT: INFO (hb): (hb.c::2502) HB node bb91800f00a0142 in different cluster - succession lists don't match
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91800f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb90e00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91b00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb91700f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::2567) Cluster Integrity Check: Detected succession list discrepancy between node bb90d00f00a0142 and self bb90500f00a0142
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Paxos List [bb91800f00a0142,bb91700f00a0142,bb90d00f00a0142,bb90600f00a0142,bb90500f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (paxos): (paxos.c::273) Node List [bb91b00f00a0142,bb91800f00a0142,bb91700f00a0142,bb90e00f00a0142,bb90d00f00a0142]
Jan 23 2016 07:38:12 GMT: INFO (hb): (hb.c::3015) Marking node add for paxos recovery: bb90e00f00a0142
Jan 23 2016 07:38:12 GMT: INFO (hb): (hb.c::3015) Marking node add for paxos recovery: bb91b00f00a0142

#2

Try setting paxos-recovery-policy to auto-reset-master. This policy was introduced in 3.7.0, my tests no longer get into these situations since enabling this policy.