Hi,
We’re running into an issue where the cluster stops seeing some of its nodes (possibly due AWS network flakiness) but it is never able to recover from that. This has happened 3 times over the last 4 days so it’s becoming a big issue. I tried dun/unduning but it didn’t help. The only thing that seems to help is a full cluster restart which (with cold restart) is a big issue as it costs us 1hr+ of downtime. Is there anything we can do to get the cluster to re-sync without downtime?
ip:port Build Cluster Cluster Free Free Migrates Node Principal Replicated Sys
. Size Visibility Disk Mem . ID ID Objects Free
. . . pct pct . . . . Mem
ip-10-111-138-147.ec2.internal:3000 3.5.15 13 false 71 41 (289,2) BB9938A6F0A0022 BB9EE49930B0022 530,432,759 34
ip-10-111-139-173.ec2.internal:3000 3.5.15 13 false 74 45 (1,1) BB9AD8B6F0A0022 BB9EE49930B0022 495,814,943 45
ip-10-111-139-74.ec2.internal:3000 3.5.15 13 false 70 40 (352,2) BB94A8B6F0A0022 BB9EE49930B0022 542,284,897 35
ip-10-123-137-38.ec2.internal:3000 3.5.15 13 false 72 43 (0,1) BB926897B0A0022 BB9EE49930B0022 507,573,115 45
ip-10-13-162-113.ec2.internal:3000 3.5.15 13 false 74 47 (0,0) BB90F88440B0022 BB9EE49930B0022 471,453,671 49
ip-10-144-96-162.ec2.internal:3000 3.5.15 13 false 74 47 (31,2) BB960CB7F0B0022 BB9EE49930B0022 477,242,998 41
ip-10-150-109-85.ec2.internal:3000 3.5.15 13 false 75 47 (1,0) BB9EE49930B0022 BB9EE49930B0022 477,220,750 46
ip-10-155-175-218.ec2.internal:3000 3.5.15 13 false 72 43 (154,4) BB98A03900B0022 BB9EE49930B0022 508,699,003 38
ip-10-158-130-242.ec2.internal:3000 3.5.15 13 false 72 42 (180,1) BB98F849A0B0022 BB9EE49930B0022 524,100,857 30
ip-10-159-124-56.ec2.internal:3000 3.5.15 13 false 75 46 (19,1) BB9B9849A0B0022 BB9EE49930B0022 487,249,531 39
ip-10-16-162-206.ec2.internal:3000 3.5.15 13 false 73 43 (238,1) BB9CEA2100A0022 BB9EE49930B0022 511,080,500 36
ip-10-16-162-254.ec2.internal:3000 3.5.15 20 true 78 55 (182,0) BB9FEA2100A0022 BB9FEA2100A0022 402,243,111 45
ip-10-167-65-135.ec2.internal:3000 3.5.15 19 false 79 55 (111,2) BB9F305830B0022 BB9F685900B0022 404,907,124 39
ip-10-186-157-14.ec2.internal:3000 3.5.15 13 false 72 44 (268,12) BB98703900B0022 BB9EE49930B0022 504,577,391 34
ip-10-233-107-4.ec2.internal:3000 3.5.15 19 false 74 49 (0,0) BB9B3057A0B0022 BB9F685900B0022 459,590,912 28
ip-10-45-54-41.ec2.internal:3000 3.5.15 15 false 79 59 (0,0) BB9C8429A0B0022 BB9F685900B0022 363,002,662 54
ip-10-61-177-58.ec2.internal:3000 3.5.15 19 false 76 51 (154,6) BB9E8429A0B0022 BB9F685900B0022 435,457,457 30
ip-10-69-174-74.ec2.internal:3000 3.5.15 14 false 70 40 (0,11) BB9F685900B0022 BB9F685900B0022 537,234,525 38
ip-10-69-76-209.ec2.internal:3000 3.5.15 13 false 73 43 (85,2) BB9E305830B0022 BB9EE49930B0022 510,572,508 38
ip-10-93-128-96.ec2.internal:3000 3.5.15 19 false 75 51 (123,5) BB960805D0A0022 BB9F685900B0022 438,592,928 40
Then after 12 hours I tried to run this: asadm -e “cluster dun all; shell sleep 15; cluster undun all”
ip:port Build Cluster Cluster Free Free Migrates Node Principal Replicated Sys
. Size Visibility Disk Mem . ID ID Objects Free
. . . pct pct . . . . Mem
ip-10-111-138-147.ec2.internal:3000 3.5.15 1 false 71 41 (0,0) BB9938A6F0A0022 BB9FEA2100A0022 530,207,185 41
ip-10-111-139-173.ec2.internal:3000 3.5.15 1 false 74 45 (0,0) BB9AD8B6F0A0022 BB9FEA2100A0022 495,679,435 47
ip-10-111-139-74.ec2.internal:3000 3.5.15 7 false 70 40 (0,0) BB94A8B6F0A0022 BB9FEA2100A0022 542,186,166 42
ip-10-123-137-38.ec2.internal:3000 3.5.15 1 false 72 43 (0,0) BB926897B0A0022 BB9FEA2100A0022 507,434,748 45
ip-10-13-162-113.ec2.internal:3000 3.5.15 7 false 74 47 (0,0) BB90F88440B0022 BB9FEA2100A0022 471,306,685 49
ip-10-144-96-162.ec2.internal:3000 3.5.15 7 false 74 47 (0,0) BB960CB7F0B0022 BB9FEA2100A0022 476,957,500 48
ip-10-150-109-85.ec2.internal:3000 3.5.15 1 false 75 47 (0,0) BB9EE49930B0022 BB9FEA2100A0022 477,107,343 48
ip-10-155-175-218.ec2.internal:3000 3.5.15 7 false 72 43 (0,0) BB98A03900B0022 BB9FEA2100A0022 508,284,308 44
ip-10-158-130-242.ec2.internal:3000 3.5.15 1 false 72 42 (0,0) BB98F849A0B0022 BB9FEA2100A0022 523,958,189 35
ip-10-159-124-56.ec2.internal:3000 3.5.15 1 false 75 46 (0,0) BB9B9849A0B0022 BB9FEA2100A0022 487,088,978 46
ip-10-16-162-206.ec2.internal:3000 3.5.15 1 false 73 43 (0,0) BB9CEA2100A0022 BB9FEA2100A0022 510,881,539 44
ip-10-16-162-254.ec2.internal:3000 3.5.15 1 false 78 55 (0,0) BB9FEA2100A0022 BB9FEA2100A0022 401,793,692 50
ip-10-167-65-135.ec2.internal:3000 3.5.15 1 false 79 55 (0,0) BB9F305830B0022 BB9FEA2100A0022 404,427,309 46
ip-10-186-157-14.ec2.internal:3000 3.5.15 7 false 72 44 (0,0) BB98703900B0022 BB9FEA2100A0022 502,509,745 42
ip-10-233-107-4.ec2.internal:3000 3.5.15 1 false 74 49 (0,0) BB9B3057A0B0022 BB9FEA2100A0022 459,440,994 29
ip-10-45-54-41.ec2.internal:3000 3.5.15 1 false 79 59 (0,0) BB9C8429A0B0022 BB9FEA2100A0022 362,549,790 54
ip-10-61-177-58.ec2.internal:3000 3.5.15 1 false 76 51 (0,0) BB9E8429A0B0022 BB9FEA2100A0022 434,947,297 38
ip-10-69-174-74.ec2.internal:3000 3.5.15 1 false 70 40 (0,0) BB9F685900B0022 BB9FEA2100A0022 536,799,518 37
ip-10-69-76-209.ec2.internal:3000 3.5.15 1 false 73 43 (0,0) BB9E305830B0022 BB9FEA2100A0022 510,371,666 45
ip-10-93-128-96.ec2.internal:3000 3.5.15 7 false 75 51 (0,0) BB960805D0A0022 BB9F305830B0022 438,522,767 47
We’ve already increased the network settings to interval 150, timeout 40. Any thoughts on to what’s going on and why the cluster is unable to get back in-sync?
Here’s a log snippet from one of the servers:
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb960805d0a0022 and self bb98f849a0b0022
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes=bb9f305830b0022,bb9ee49930b0022,bb9e8429a0b0022,bb9e305830b0022,bb9cea2100a0022,bb9c8429a0b0022,bb9b9849a0b0022,bb9b3057a0b0022,bb9ad8b6f0a0022,bb9938a6f0a0022,bb98f849a0b0022,bb98a03900b0022,bb98703900b0022,bb960cb7f0b0022,bb960805d0a0022,bb94a8b6f0a0022,bb926897b0a0022,bb90f88440b0022
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2519) as_paxos_retransmit_check: node bb98f849a0b0022 retransmitting partition sync request to principal bb9fea2100a0022 ...
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2229) Sent partition sync request to node bb9fea2100a0022
Nov 09 2015 15:13:30 GMT: INFO (hb): (hb.c::2319) HB node bb960805d0a0022 in different cluster - succession lists don't match
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb960805d0a0022 and self bb98f849a0b0022
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes=bb9f305830b0022,bb9ee49930b0022,bb9e8429a0b0022,bb9e305830b0022,bb9cea2100a0022,bb9c8429a0b0022,bb9b9849a0b0022,bb9b3057a0b0022,bb9ad8b6f0a0022,bb9938a6f0a0022,bb98f849a0b0022,bb98a03900b0022,bb98703900b0022,bb960cb7f0b0022,bb960805d0a0022,bb94a8b6f0a0022,bb926897b0a0022,bb90f88440b0022
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2519) as_paxos_retransmit_check: node bb98f849a0b0022 retransmitting partition sync request to principal bb9fea2100a0022 ...
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2229) Sent partition sync request to node bb9fea2100a0022