Migrations are stuck for over 1 week


#1

Continuing the discussion from Migrations stuck on a node:

Here’s the output of ‘show statistics like migrate’ on my cluster:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                         :   aero-0     aero-1     aero-3     aero-4     aero-c-7    aero-c-m   
migrate_msgs_recv            :   74558464   49248664   39201115   15823290   495537078   79078954   
migrate_msgs_sent            :   72242170   46970918   38585233   15219436   485633485   76755175   
migrate_num_incoming_accepted:   11560      792        403        145        157010      11968      
migrate_num_incoming_refused :   0          0          0          0          9314265     0          
migrate_progress_recv        :   0          0          0          0          0           0          
migrate_progress_send        :   4          4          4          4          0           4          
migrate_rx_objs              :   0          0          0          0          0           0          
migrate_tx_objs              :   3393       3384       3407       3400       0           3375   

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~graph Namespace Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   aero-0   aero-1   aero-3   aero-4   aero-c-7   aero-c-m   
migrate-rx-partitions-initial  :   3472     3537     3392     3427     4363       3553       
migrate-rx-partitions-remaining:   3330     3402     3257     3282     4363       3420       
migrate-tx-partitions-imbalance:   29       0        0        0        58305      413        
migrate-tx-partitions-initial  :   4072     4064     4074     4074     1395       4065       
migrate-tx-partitions-remaining:   4069     4062     4062     4055     742        4064    

The numbers in the graph namespace have remained the same for a very long time, and I notice that on aero-c-7mxp (abbreviated as aero-c-7 by asadm) there is a large imbalance as well as many refused migration messages. Is the migration stuck? What can I do to confirm this? I have checked the migration settings and they are all set to defaults. The linked thread’s suggestion of bumping the migrate-threads didn’t seem to make a difference.


#2

Each time migrate-tx-partitions-imbalance increments there is a warning about why it was incremented sent to the logs. These are potentially bad messages a few of which have been determined to be ok. Would really like to see what warnings you have coming from migration and partition.

  1. Could you provide the output of:
    cat /var/log/aerospike.log | grep WARNING | cut -d ' ' -f8 | sort | uniq -c
  2. Which version of Aerospike is this from?

To recover, perform a rolling restart. When stopping a node ensure that the rest of the nodes report the new cluster size without that node and that migrations begin before starting the down node again. There have been a few migrations improvements in the latest releases so if you aren’t already on the latest you could use this opportunity to upgrade to the latest.


#3

Hi kporter, thanks for the reply. I did figure this out. aero-c-7mxp was created using v3.7.1 while aero-0 through aero-4 were still on version 3.5.15. Somehow, this scenario made it so that 7mxp was refusing migrates across the entire cluster. My solution was to boot aero-c-7mxp from the cluster. At that point, migrates started moving again. You can see in the output in my original message that ‘migrate_num_incoming_refused’ was very high for that machine. Seeing that, I decided to take it out of the equation as no other machine was refusing migrates.

Unfortunately, I booted it from the cluster by deleting it from my cloud provider, so I can’t provide logs. However holistically i had turned migrate logging up to debug level, and saw that everyone was trying to send migrates to 7mxp, and 7mxp was rejecting them (“partition not ready” perhaps?)

With a good migration, the number of remaining partitions for both tx and rx steadily decreases over time. I hope that it will be done by the end of the day.

Thanks for the reply!


#4

Are you sure the other nodes were on 3.5.15? The migrate-tx-partitions-imbalance stat wasn’t introduced until 3.7.0 and 3 of your nodes have a positive value for it, and I believe if the node didn’t report the value asadm will report “N/A”. Any of the nodes with a positive value for migrate-tx-partitions-imbalance would have also logged a warning as to why this increased and that information would be very useful in understanding what has happened here.

Cannot find anything similar to this.


#5

Hi kporter, they are all on 3.7.1 now, but when 7mxp was created, all others were on 3.5.15. That has since been fixed because we really needed the auto-reset-master paxos recovery policy.

Here’s a migrate debug log from aero-0 failing to send to 7mxp:

Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:migrate_send_finish:1125) migrate xmit failed, mig 171001, cluster key mismatch: {graph:1070}
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:as_migrate_print2_cluster_key:407) MIGRATE_XMIT_FAIL: cluster key global 5117ba069fbc1f8c recd 47dc58015ab25501
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:migrate_migrate_destroy:755) {graph:1070}migrate_migrate_destroy: END MIG 0x7f1a98c32408
Jan 20 2016 14:25:51 GMT: DEBUG (migrate): (migrate.c:migration_pop:2086) got smallest migrate, tree size = 619, q sz = 2752

The ‘recd’ and ‘global’ keys were always the same, the cluster key was always different, but this message appeared countless times. Once i removed the offending node, but kept migrate logging on, aero-0 was able to successfully send migrates out.

I hope that helps at least a little bit.


#6

Further questions:

  1. Have you increased the number of migrate-threads to 4? The stats you provided indicate this to be the case, just want to confirm.
  2. Have you decreased migrate-max-num-incoming from 256, and if so what is the new value? This should also be fine but recently we discovered an pre-existing race condition where the counter that limits the max num incoming can become stuck above 0.
    I feel fairly certain this is the problem, exceeding migrate-max-num-incoming is the only condition that increments the migrate-num-incoming-refused stat. This has been fixed in top of tree and should be in the next release tracked by AER-3296. In the meantime, you can workaround this issue by increasing migrate-max-num-incoming.

#7
  1. Yes
  2. No, it is still set to 256; I had never changed anything but the config values described here I could not tell you what the value was on 7mxp before i nuked it.

Good to know it is fixed in top of tree. Thanks for the discussion and glad to hear it is a real bug and not just us misusing the product.


#8

Just spoke with QA, 3.7.2 is expected to be released by the end of the week as long as there are no surprises.

Hm, that is surprising, I do know that one of the race conditions fixed in 3.7.0 expanded this race window making it easier to discover. If you are still seeing the migrate-num-incoming-refused increase then I would increase migrate-max-num-incoming as a workaround until 3.7.2 is released.


#9

Hi kporter, after booting out the 7mxp node, we have yet to see ‘refused’ again. Thanks for the workaround though.


#10

Hi kporter, if you’re still there, there appears to be one more issue. The migration is, I would say, 99% complete. However, it appears to be stuck on some last partitions. Looks like this:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                    :   aero-0   aero-1   aero-3   aero-4   aero-b-9   aero-b-j   aero-b-z   aero-c-g   aero-c-k   aero-c-m   
migrate-max-num-incoming:   512      512      512      512      512        512        512        512        512        512        
migrate-read-priority   :   10       10       10       10       10         10         10         10         10         10         
migrate-read-sleep      :   100      100      100      100      500        500        500        500        500        100        
migrate-rx-lifetime-ms  :   60000    60000    60000    60000    60000      60000      60000      60000      60000      60000      
migrate-threads         :   4        4        4        4        4          4          4          1          4          4          
migrate-xmit-hwm        :   100      100      100      100      100        100        100        100        100        100        
migrate-xmit-lwm        :   5        5        5        5        5          5          5          5          5          5          
migrate-xmit-priority   :   40       40       40       40       40         40         40         40         40         40         
migrate-xmit-sleep      :   100      100      100      100      100        100        100        100        100        100        

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~graph Namespace Configuration~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   aero-0   aero-1   aero-3   aero-4   aero-b-9   aero-b-j   aero-b-z   aero-c-g   aero-c-k   aero-c-m   
migrate-rx-partitions-initial  :   671      707      763      768      1895       1851       1840       1738       769        1054       
migrate-rx-partitions-remaining:   24       1        11       11       1          8          6          36         0          0          
migrate-tx-partitions-imbalance:   29       0        0        0        0          0          0          0          0          413        
migrate-tx-partitions-initial  :   2008     1938     1807     1801     369        352        359        368        1468       1586       
migrate-tx-partitions-remaining:   4        13       10       5        0          1          1          47         10         7

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                         :   aero-0      aero-1      aero-3      aero-4      aero-b-9    aero-b-j    aero-b-z    aero-c-g    aero-c-k    aero-c-m    
migrate_msgs_recv            :   573316085   546728692   564498519   547012091   422959253   405414968   403435543   502599116   433458329   858367837   
migrate_msgs_sent            :   570768816   544273037   563712562   546201355   422953325   405353754   403376218   502526034   433242495   855850739   
migrate_num_incoming_accepted:   14197       3584        3148        2919        2235        2098        2160        826         2638        14760       
migrate_num_incoming_refused :   0           0           0           0           0           0           0           0           0           0           
migrate_progress_recv        :   0           0           0           2           0           1           1           2           0           0           
migrate_progress_send        :   0           1           1           2           0           0           0           1           0           1           
migrate_rx_objs              :   0           0           0           2           0           1           1           2           0           0           
migrate_tx_objs              :   0           1           1           2           0           0           0           48          0           1  

The number of recv/sent migrate messages is still moving, so I assume something is going on. But the partitions remaining is not moving at all. Could it be that some partitions are much larger than others?


#11

Raised migrate logging to DEBUG, got something interesting:

aero-1, aero-3, aero-4, aero-c-mole

Jan 21 2016 16:48:08 GMT: DEBUG (migrate): (migrate.c:migrate_retransmit_reduce_fn:1080) migrate_retransmit: mig 226162 sending: now 1041929438 xmit_ms 1041928436
Jan 21 2016 16:48:08 GMT: DEBUG (migrate): (migrate.c:migrate_retransmit_reduce_fn:1087) migrate_retransmit: mig 226162 sending: dstnode bb915d4f00a0142 msg 0x7fedaadbee08
(over and over again)

aero-b-zmh1 and aero-b-j12q

=== aero-b-zmh1 ===
Jan 21 2016 16:48:06 GMT: WARNING (rw): (thr_rw.c::3276) {graph} ttl 3958976118 exceeds 315360000 - set config value max-ttl to suppress this warning <Digest>:0xddf59303b610a7c940277595f4649e8bf3cfa447
Jan 21 2016 16:48:06 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:06 GMT: WARNING (rw): (thr_rw.c::3276) {graph} ttl 3956198396 exceeds 315360000 - set config value max-ttl to suppress this warning <Digest>:0x1bf430b76819b15d9402bdf626b697f2da983eb5
Jan 21 2016 16:48:07 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:08 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:09 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:10 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:11 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:12 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
Jan 21 2016 16:48:13 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x7d08c9738210eaea98c8d432a233fd8987fe9e61
=== aero-b-j12q ===
Jan 21 2016 16:48:05 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:06 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:07 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:08 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:09 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:10 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:11 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:12 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d
Jan 21 2016 16:48:13 GMT: WARNING (migrate): (migrate.c::1228) migration received binless pickle <Digest>:0x6e52ec99231227c3b361ac34ac20a779e44d0e6d

Looks like a standstill at this point.


#12

This check was introduced in 3.7.1 and wasn’t properly handled and becomes stuck in a retransmit loop. This has also been corrected in 3.7.2. Basically there are records in which you have removed all of the bins, normally this would delete the record but we recently discovered ways in which this record would persist without bins.


#13

Aerospike 3.7.2 was just released which includes the patch for binless migrations.