Cluster not syncing back: try rolling restart or fast restart (AER-4500)

Hi,

We’re running into an issue where the cluster stops seeing some of its nodes (possibly due AWS network flakiness) but it is never able to recover from that. This has happened 3 times over the last 4 days so it’s becoming a big issue. I tried dun/unduning but it didn’t help. The only thing that seems to help is a full cluster restart which (with cold restart) is a big issue as it costs us 1hr+ of downtime. Is there anything we can do to get the cluster to re-sync without downtime?

ip:port                                 Build   Cluster      Cluster   Free   Free   Migrates              Node         Principal    Replicated    Sys
                                            .      Size   Visibility   Disk    Mem          .                ID                ID       Objects   Free
                                            .         .            .    pct    pct          .                 .                 .             .    Mem
ip-10-111-138-147.ec2.internal:3000    3.5.15        13        false     71     41    (289,2)   BB9938A6F0A0022   BB9EE49930B0022   530,432,759     34
ip-10-111-139-173.ec2.internal:3000    3.5.15        13        false     74     45      (1,1)   BB9AD8B6F0A0022   BB9EE49930B0022   495,814,943     45
ip-10-111-139-74.ec2.internal:3000     3.5.15        13        false     70     40    (352,2)   BB94A8B6F0A0022   BB9EE49930B0022   542,284,897     35              
ip-10-123-137-38.ec2.internal:3000     3.5.15        13        false     72     43      (0,1)   BB926897B0A0022   BB9EE49930B0022   507,573,115     45              
ip-10-13-162-113.ec2.internal:3000     3.5.15        13        false     74     47      (0,0)   BB90F88440B0022   BB9EE49930B0022   471,453,671     49              
ip-10-144-96-162.ec2.internal:3000     3.5.15        13        false     74     47     (31,2)   BB960CB7F0B0022   BB9EE49930B0022   477,242,998     41              
ip-10-150-109-85.ec2.internal:3000     3.5.15        13        false     75     47      (1,0)   BB9EE49930B0022   BB9EE49930B0022   477,220,750     46              
ip-10-155-175-218.ec2.internal:3000    3.5.15        13        false     72     43    (154,4)   BB98A03900B0022   BB9EE49930B0022   508,699,003     38              
ip-10-158-130-242.ec2.internal:3000    3.5.15        13        false     72     42    (180,1)   BB98F849A0B0022   BB9EE49930B0022   524,100,857     30              
ip-10-159-124-56.ec2.internal:3000     3.5.15        13        false     75     46     (19,1)   BB9B9849A0B0022   BB9EE49930B0022   487,249,531     39              
ip-10-16-162-206.ec2.internal:3000     3.5.15        13        false     73     43    (238,1)   BB9CEA2100A0022   BB9EE49930B0022   511,080,500     36              
ip-10-16-162-254.ec2.internal:3000     3.5.15        20         true     78     55    (182,0)   BB9FEA2100A0022   BB9FEA2100A0022   402,243,111     45              
ip-10-167-65-135.ec2.internal:3000     3.5.15        19        false     79     55    (111,2)   BB9F305830B0022   BB9F685900B0022   404,907,124     39              
ip-10-186-157-14.ec2.internal:3000     3.5.15        13        false     72     44   (268,12)   BB98703900B0022   BB9EE49930B0022   504,577,391     34              
ip-10-233-107-4.ec2.internal:3000      3.5.15        19        false     74     49      (0,0)   BB9B3057A0B0022   BB9F685900B0022   459,590,912     28              
ip-10-45-54-41.ec2.internal:3000       3.5.15        15        false     79     59      (0,0)   BB9C8429A0B0022   BB9F685900B0022   363,002,662     54              
ip-10-61-177-58.ec2.internal:3000      3.5.15        19        false     76     51    (154,6)   BB9E8429A0B0022   BB9F685900B0022   435,457,457     30              
ip-10-69-174-74.ec2.internal:3000      3.5.15        14        false     70     40     (0,11)   BB9F685900B0022   BB9F685900B0022   537,234,525     38              
ip-10-69-76-209.ec2.internal:3000      3.5.15        13        false     73     43     (85,2)   BB9E305830B0022   BB9EE49930B0022   510,572,508     38              
ip-10-93-128-96.ec2.internal:3000      3.5.15        19        false     75     51    (123,5)   BB960805D0A0022   BB9F685900B0022   438,592,928     40              

Then after 12 hours I tried to run this: asadm -e “cluster dun all; shell sleep 15; cluster undun all”

ip:port                                 Build   Cluster      Cluster   Free   Free   Migrates              Node         Principal    Replicated    Sys
                                            .      Size   Visibility   Disk    Mem          .                ID                ID       Objects   Free
                                            .         .            .    pct    pct          .                 .                 .             .    Mem
ip-10-111-138-147.ec2.internal:3000    3.5.15         1        false     71     41      (0,0)   BB9938A6F0A0022   BB9FEA2100A0022   530,207,185     41
ip-10-111-139-173.ec2.internal:3000    3.5.15         1        false     74     45      (0,0)   BB9AD8B6F0A0022   BB9FEA2100A0022   495,679,435     47
ip-10-111-139-74.ec2.internal:3000     3.5.15         7        false     70     40      (0,0)   BB94A8B6F0A0022   BB9FEA2100A0022   542,186,166     42
ip-10-123-137-38.ec2.internal:3000     3.5.15         1        false     72     43      (0,0)   BB926897B0A0022   BB9FEA2100A0022   507,434,748     45
ip-10-13-162-113.ec2.internal:3000     3.5.15         7        false     74     47      (0,0)   BB90F88440B0022   BB9FEA2100A0022   471,306,685     49
ip-10-144-96-162.ec2.internal:3000     3.5.15         7        false     74     47      (0,0)   BB960CB7F0B0022   BB9FEA2100A0022   476,957,500     48
ip-10-150-109-85.ec2.internal:3000     3.5.15         1        false     75     47      (0,0)   BB9EE49930B0022   BB9FEA2100A0022   477,107,343     48
ip-10-155-175-218.ec2.internal:3000    3.5.15         7        false     72     43      (0,0)   BB98A03900B0022   BB9FEA2100A0022   508,284,308     44
ip-10-158-130-242.ec2.internal:3000    3.5.15         1        false     72     42      (0,0)   BB98F849A0B0022   BB9FEA2100A0022   523,958,189     35
ip-10-159-124-56.ec2.internal:3000     3.5.15         1        false     75     46      (0,0)   BB9B9849A0B0022   BB9FEA2100A0022   487,088,978     46
ip-10-16-162-206.ec2.internal:3000     3.5.15         1        false     73     43      (0,0)   BB9CEA2100A0022   BB9FEA2100A0022   510,881,539     44
ip-10-16-162-254.ec2.internal:3000     3.5.15         1        false     78     55      (0,0)   BB9FEA2100A0022   BB9FEA2100A0022   401,793,692     50
ip-10-167-65-135.ec2.internal:3000     3.5.15         1        false     79     55      (0,0)   BB9F305830B0022   BB9FEA2100A0022   404,427,309     46
ip-10-186-157-14.ec2.internal:3000     3.5.15         7        false     72     44      (0,0)   BB98703900B0022   BB9FEA2100A0022   502,509,745     42
ip-10-233-107-4.ec2.internal:3000      3.5.15         1        false     74     49      (0,0)   BB9B3057A0B0022   BB9FEA2100A0022   459,440,994     29
ip-10-45-54-41.ec2.internal:3000       3.5.15         1        false     79     59      (0,0)   BB9C8429A0B0022   BB9FEA2100A0022   362,549,790     54
ip-10-61-177-58.ec2.internal:3000      3.5.15         1        false     76     51      (0,0)   BB9E8429A0B0022   BB9FEA2100A0022   434,947,297     38
ip-10-69-174-74.ec2.internal:3000      3.5.15         1        false     70     40      (0,0)   BB9F685900B0022   BB9FEA2100A0022   536,799,518     37
ip-10-69-76-209.ec2.internal:3000      3.5.15         1        false     73     43      (0,0)   BB9E305830B0022   BB9FEA2100A0022   510,371,666     45
ip-10-93-128-96.ec2.internal:3000      3.5.15         7        false     75     51      (0,0)   BB960805D0A0022   BB9F305830B0022   438,522,767     47

We’ve already increased the network settings to interval 150, timeout 40. Any thoughts on to what’s going on and why the cluster is unable to get back in-sync?

Here’s a log snippet from one of the servers:

Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb960805d0a0022 and self bb98f849a0b0022
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9f305830b0022,bb9ee49930b0022,bb9e8429a0b0022,bb9e305830b0022,bb9cea2100a0022,bb9c8429a0b0022,bb9b9849a0b0022,bb9b3057a0b0022,bb9ad8b6f0a0022,bb9938a6f0a0022,bb98f849a0b0022,bb98a03900b0022,bb98703900b0022,bb960cb7f0b0022,bb960805d0a0022,bb94a8b6f0a0022,bb926897b0a0022,bb90f88440b0022
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2519) as_paxos_retransmit_check: node bb98f849a0b0022 retransmitting partition sync request to principal bb9fea2100a0022 ... 
Nov 09 2015 15:13:26 GMT: INFO (paxos): (paxos.c::2229) Sent partition sync request to node bb9fea2100a0022
Nov 09 2015 15:13:30 GMT: INFO (hb): (hb.c::2319) HB node bb960805d0a0022 in different cluster - succession lists don't match
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb960805d0a0022 and self bb98f849a0b0022
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9f305830b0022,bb9ee49930b0022,bb9e8429a0b0022,bb9e305830b0022,bb9cea2100a0022,bb9c8429a0b0022,bb9b9849a0b0022,bb9b3057a0b0022,bb9ad8b6f0a0022,bb9938a6f0a0022,bb98f849a0b0022,bb98a03900b0022,bb98703900b0022,bb960cb7f0b0022,bb960805d0a0022,bb94a8b6f0a0022,bb926897b0a0022,bb90f88440b0022
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2519) as_paxos_retransmit_check: node bb98f849a0b0022 retransmitting partition sync request to principal bb9fea2100a0022 ... 
Nov 09 2015 15:13:31 GMT: INFO (paxos): (paxos.c::2229) Sent partition sync request to node bb9fea2100a0022

Given the situation, try a rolling restart of the cluster.

Take one node down and bring it back up

wait for asinfo -v STATUS to return OK

Repeat on the next node.

@kporter,

Thanks for getting back to me. A (cold) node restart by itself takes between 30mins and 1hr so for 20 nodes that would take between 10 and 20 hours, and this is something that has been happening to us every few daily recently. Sounds like a serious limitation - are there any plans to have the cluster be able to recover from situations like this automatically?

There is ongoing work to improve cluster stability. I’ve been running these changes in my own branch for a couple weeks now with mostly positive results. The tests I am doing cause many cluster disruptions and with the new changes I can test the component I need to test without worrying much about the cluster failing to recover. I’m unsure when this is planned for release.

@naoum,

A solution available now to get the cluster to re-sync with minimal downtime is to upgrade to the Aerospike Enterprise Edition, which has a feature called fast restart (a.k.a. warm restart). This feature enables nodes to start up and re-join their cluster much faster. A node with over 1 billion records will restart in about 10 seconds (as opposed to 40+ minutes without this feature). This allows cluster upgrades and various other operations to go much faster.

In the Enterprise Edition, for namespaces configured with Flash storage or Data In Index, the node’s default behavior is to always try to fast restart.

@Mnemaudsyne,

I’m aware of fast restart and it seems like a great feature - you guys should really consider adding it to the community edition!

If the issue is all around cluster id and node list is there any sort of an admin command that can be run to override/refresh that during runtime?

Thanks!

1 Like

Yes, but that command is dun/undun which you have tried and it didn’t work for you. These normally do a fair job pushing the cluster to recover. Sometimes I do need to run the sequence twice though (have you tried running the sequence again?). If the second try doesn’t work (and I haven’t needed to do this since ‘all’ became an option for dun) I would then proceed to stop nodes claiming to be in the largest clusters.

kporter, I’ve been experiencing this same issue on and off. So far, I’ve only seen it in AWS - it hasn’t effected our bare metal clusters. And it’s only been one of our AWS clusters (the others have been fine). Just curious if you’ve seen the problem on AWS, bare metal, or some other cloud provider.

I’m mostly seeing the issues in my local test environment which is multiple nodes running in docker containers with cluster disruptions generated by tc or restarts. The ongoing work has eliminated these problems from my test environment.

The internal branch was prompted mainly by cloud environments. Though much less frequent, I have seen the same issues on bare metal environments.

@kporter,

Could I use “tip” in cases like this? The docs for it are not very detailed.

The tip command if for cluster discovery, I don’t think that is the problem here. Looks like these changes have been merged into the main branch so they should be in the next release, documented in JIRA ticket AER-4500.