The numbers in the graph namespace have remained the same for a very long time, and I notice that on aero-c-7mxp (abbreviated as aero-c-7 by asadm) there is a large imbalance as well as many refused migration messages. Is the migration stuck? What can I do to confirm this? I have checked the migration settings and they are all set to defaults. The linked thread’s suggestion of bumping the migrate-threads didn’t seem to make a difference.
Each time migrate-tx-partitions-imbalance increments there is a warning about why it was incremented sent to the logs. These are potentially bad messages a few of which have been determined to be ok. Would really like to see what warnings you have coming from migration and partition.
Could you provide the output of: cat /var/log/aerospike.log | grep WARNING | cut -d ' ' -f8 | sort | uniq -c
Which version of Aerospike is this from?
To recover, perform a rolling restart. When stopping a node ensure that the rest of the nodes report the new cluster size without that node and that migrations begin before starting the down node again. There have been a few migrations improvements in the latest releases so if you aren’t already on the latest you could use this opportunity to upgrade to the latest.
Hi kporter, thanks for the reply. I did figure this out. aero-c-7mxp was created using v3.7.1 while aero-0 through aero-4 were still on version 3.5.15. Somehow, this scenario made it so that 7mxp was refusing migrates across the entire cluster. My solution was to boot aero-c-7mxp from the cluster. At that point, migrates started moving again. You can see in the output in my original message that ‘migrate_num_incoming_refused’ was very high for that machine. Seeing that, I decided to take it out of the equation as no other machine was refusing migrates.
Unfortunately, I booted it from the cluster by deleting it from my cloud provider, so I can’t provide logs. However holistically i had turned migrate logging up to debug level, and saw that everyone was trying to send migrates to 7mxp, and 7mxp was rejecting them (“partition not ready” perhaps?)
With a good migration, the number of remaining partitions for both tx and rx steadily decreases over time. I hope that it will be done by the end of the day.
Are you sure the other nodes were on 3.5.15? The migrate-tx-partitions-imbalance stat wasn’t introduced until 3.7.0 and 3 of your nodes have a positive value for it, and I believe if the node didn’t report the value asadm will report “N/A”. Any of the nodes with a positive value for migrate-tx-partitions-imbalance would have also logged a warning as to why this increased and that information would be very useful in understanding what has happened here.
Hi kporter, they are all on 3.7.1 now, but when 7mxp was created, all others were on 3.5.15. That has since been fixed because we really needed the auto-reset-master paxos recovery policy.
Here’s a migrate debug log from aero-0 failing to send to 7mxp:
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:migrate_send_finish:1125) migrate xmit failed, mig 171001, cluster key mismatch: {graph:1070}
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:as_migrate_print2_cluster_key:407) MIGRATE_XMIT_FAIL: cluster key global 5117ba069fbc1f8c recd 47dc58015ab25501
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:migrate_migrate_destroy:755) {graph:1070}migrate_migrate_destroy: END MIG 0x7f1a98c32408
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:migration_pop:2086) got smallest migrate, tree size = 619, q sz = 2752
The ‘recd’ and ‘global’ keys were always the same, the cluster key was always different, but this message appeared countless times. Once i removed the offending node, but kept migrate logging on, aero-0 was able to successfully send migrates out.
Have you increased the number of migrate-threads to 4? The stats you provided indicate this to be the case, just want to confirm.
Have you decreased migrate-max-num-incoming from 256, and if so what is the new value? This should also be fine but recently we discovered an pre-existing race condition where the counter that limits the max num incoming can become stuck above 0.
I feel fairly certain this is the problem, exceeding migrate-max-num-incoming is the only condition that increments the migrate-num-incoming-refused stat. This has been fixed in top of tree and should be in the next release tracked by AER-3296. In the meantime, you can workaround this issue by increasing migrate-max-num-incoming.
No, it is still set to 256; I had never changed anything but the config values described here I could not tell you what the value was on 7mxp before i nuked it.
Good to know it is fixed in top of tree. Thanks for the discussion and glad to hear it is a real bug and not just us misusing the product.
Just spoke with QA, 3.7.2 is expected to be released by the end of the week as long as there are no surprises.
Hm, that is surprising, I do know that one of the race conditions fixed in 3.7.0 expanded this race window making it easier to discover. If you are still seeing the migrate-num-incoming-refused increase then I would increase migrate-max-num-incoming as a workaround until 3.7.2 is released.
Hi kporter, if you’re still there, there appears to be one more issue. The migration is, I would say, 99% complete. However, it appears to be stuck on some last partitions. Looks like this:
The number of recv/sent migrate messages is still moving, so I assume something is going on. But the partitions remaining is not moving at all. Could it be that some partitions are much larger than others?
This check was introduced in 3.7.1 and wasn’t properly handled and becomes stuck in a retransmit loop. This has also been corrected in 3.7.2. Basically there are records in which you have removed all of the bins, normally this would delete the record but we recently discovered ways in which this record would persist without bins.