Migrations stuck on a node


#1

Hi,

I have 2 nodes in my cluster that seem to be stuck at a certain number of outstanding migrates (for hours) and playing with the different migrate options doesn’t seem to affect that. The rest of the cluster doesn’t really have any outstanding migrates. Also overall the box is under 50% CPU utilized and it’s under the disk/mem watermarks. Any ideas?

Relevant settings:

asinfo -v get-config|tr ";" "\n"|grep migrate                              
migrate-threads=2
migrate-xmit-priority=100
migrate-xmit-sleep=50
migrate-read-priority=40
migrate-read-sleep=50
migrate-xmit-hwm=30
migrate-xmit-lwm=20
migrate-max-num-incoming=256
migrate-rx-lifetime-ms=60000

Some logs:

Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4796)  system memory: free 27962348kb ( 44 percent free ) 
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 213 , 0 ) ::: ClusterSize 20 ::: objects 423825746 ::: sub_objects 0
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4812)  rec refs 423825747 ::: rec locks 1 ::: trees 0 ::: wr reqs 1 ::: mig tx 213 ::: mig rx 1
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0 
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 1 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (902, 8909043, 8908141) : hb (32, 4143, 4111) : fab (684, 1737, 1053)
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 16682056
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 12685829 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 9 mrf 0 eh 3519 efd 109 efa 3410 um 0 mcf 3906 rc 4102 
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 1 migtx 213 ssdr 0 ssdw 0 rw 2
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4885) namespace biddingdb: disk inuse: 363410280832 memory inuse: 27124847744 (bytes) sindex memory inuse: 0 (bytes) avail pct 67 cache-read pct 9.40
Nov 06 2015 17:36:20 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 216 sync 190 desync 0 zombie 2 wait 0 absent 3688

asmonitor output one of the stuck nodes (there’s 19 other nodes):

ip:port                                 Build   Cluster      Cluster   Free   Free   Migrates              Node         Principal    Replicated    Sys
                                            .      Size   Visibility   Disk    Mem          .                ID                ID       Objects   Free
                                            .         .            .    pct    pct          .                 .                 .             .    Mem
ip-10-155-175-218.ec2.internal:3000    3.5.15        20         true     77     53    (213,0)   BB98A03900B0022   BB9FEA2100A0022   423,826,733     44


ip/namespace                                Avail     Evicted        Master     Repl     Stop       Used   Used      Used   Used    hwm   hwm
                                              Pct     Objects       Objects   Factor   Writes       Disk   Disk       Mem    Mem   Disk   Mem
                                                .           .             .        .        .          .      %         .      %      .     .
ip-10-155-175-218.ec2.internal/biddingdb       67           0   226,064,321        2    false   338.45 G     23   25.26 G     47     50    60

Thanks!


#2

Could you run the following at 2 minutes interval and post to check if any progress:

asadm -e 'asinfo -v statistics -l like migrate'

If a non-production cluster you may be able to bump up migration threads to a high number. Start by incrementing by 1 and keep an eye on iostat and memory usage.

asinfo -v "set-config:context=service;migrate-threads=5"

#3

@lucien,

I encountered this situation again. This time I had this:

asinfo -v get-config |tr ";" "\n"|grep migrate
migrate-threads=4
migrate-xmit-priority=0
migrate-xmit-sleep=100
migrate-read-priority=0
migrate-read-sleep=100
migrate-xmit-hwm=15
migrate-xmit-lwm=10 

I checked the stats on the node and it was moving, but barely:

sinfo -v statistics |tr ";" "\n"|grep migr
migrate_msgs_sent=215895491
migrate_msgs_recv=215896773
migrate_progress_send=349
migrate_progress_recv=1
migrate_num_incoming_accepted=1317
migrate_num_incoming_refused=0
migrate_tx_objs=349
migrate_rx_objs=1

migrate_msgs_sent=215896075
migrate_msgs_recv=215897357
migrate_progress_send=349
migrate_progress_recv=1
migrate_num_incoming_accepted=1317
migrate_num_incoming_refused=0
migrate_tx_objs=349
migrate_rx_objs=1

migrate_msgs_sent=215896621
migrate_msgs_recv=215897903
migrate_progress_send=349
migrate_progress_recv=1
migrate_num_incoming_accepted=1317
migrate_num_incoming_refused=0
migrate_tx_objs=349
migrate_rx_objs=1

The cpu and disk were also quite idle.

Then I changed the threads from 4 to 6 and it unstuck itself - it went from doing a couple of dozen msgs every 10 seconds to doing hundreds of thousands of messages over a similar amount of time. Disk utilization also went significantly up and the “migrates in progress” started moving.

asinfo -v statistics |tr ";" "\n"|grep migrate
migrate_msgs_sent=226640878
migrate_msgs_recv=226642145
migrate_progress_send=339
migrate_progress_recv=1
migrate_num_incoming_accepted=1317
migrate_num_incoming_refused=0
migrate_tx_objs=339
migrate_rx_objs=1

migrate_msgs_sent=226920061
migrate_msgs_recv=226921331
migrate_progress_send=339
migrate_progress_recv=1
migrate_num_incoming_accepted=1317
migrate_num_incoming_refused=0
migrate_tx_objs=339
migrate_rx_objs=1

I’ve uploaded my log here: http://eyeview-tmp.s3.amazonaws.com/naoum/aerospike/migrates_stuck_aerospike.log - let me know what you think.

Thanks!


#4

@lucien,

I got another instance of those … It’s been 2 days and the migrations have not cleared yet. When I bump up some setting it starts migrating for a big but then it slows down again … I tried bumping the threads up to 16 (on a 8 core box) and it helped temporarily but then stopped. I also bumped all the other migration related settings just to make sure other cluster operations aren’t in the way but the cluster is basically idle.

asinfo -v 'get-config' | tr ';' '\n'|grep migr
migrate-threads=8
migrate-xmit-priority=0
migrate-xmit-sleep=100
migrate-read-priority=0
migrate-read-sleep=100
migrate-xmit-hwm=150
migrate-xmit-lwm=100
migrate-max-num-incoming=256
migrate-rx-lifetime-ms=60000

uptime
 06:33:28 up 59 days,  4:05,  4 users,  load average: 1.17, 1.18, 1.18

iostat -x 5

vg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.99    0.00    1.42    1.21    0.21   84.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.80    0.00    0.40     0.00     4.80    24.00     0.02   44.00    0.00   44.00  44.00   1.76
xvdb              0.00     0.00  192.80   24.80  3906.60  3174.40    65.08     0.09    0.40    0.32    1.06   0.23   4.96
xvdc              0.00     0.00  180.80   24.80  3810.90  3174.40    67.95     0.08    0.40    0.36    0.68   0.25   5.20 

Any thoughts? Let me know if you want me to upload logs.


#5

Naoum,

You are running ec2 instances, what kind of instances are these? Do you have provisioned IOPS for these instances?

AWS might be throttling down the network traffic.

-samir


#6

@samir

The cluster is currently only doing some writes (8k/sec) and no reads at all which is nothing for this 19 node cluster. I wanted to let the migrations finish before i switch on all the read traffic but it’s taken 2 full days so far and this node is not yet done. Whenever I bump up the migrating threads (from 4 to 8, or from 8 to 16 or from 16 to 24) that seems to get the migrations to start for a few minutes and then they got stuck again until I have to bump it to a higher number. :disappointed:

All the instances are i2.2xlarge instance and the network usage is around 20/20 mbps which is nothing on the aws network. The disks are local SSDs – http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/i2-instances.html#i2-instances-diskperf

Btw I also checked the per-node TPS in the AMC and it seems evenly distributed - it’s not an issue with some hot key or extra traffic going to this one particular node.

Any other thoughts? What can be further done to confirm if this is a bug?

At this point I’m going to do a cold restart of that node.


#7

The migrate_progress_recv stats indicate that nodes are receiving migrations, so we are not stuck from an algorithmic perspective.

  1. The await time on xvda is fairly high.
  2. Are all nodes configured with the same migration settings you show here? You can verify by running: asadm -e "show config like migrate"

You cannot really use the migrate_progress_send stat as a progress bar. During migrations there are events that both increment and decrement this count, so the count remaining the same for an extended duration may have still made progress. Starting 3.7.0 we have created namespace level stats that will indicate the number of planned migrations for the current cycle and the number or remaining migrations for the current cycle. Here is an excerpt describing the stats from git:

AER-3639 Added new ns stats for mig progress
  • New Metrics:
    • migrate_tx_partitions_scheduled: Total number of migrations this node will send during the current migration cycle for this namespace.
    • migrate_tx_partitions_remaining Number of migrations this node not yet sent during the current migration cycle for this namespace.
    • migrate_rx_partitions_scheduled Total number of migrations this node will receive during the current migration cycle for this namespace.
    • migrate_rx_partitions_remaining Number of migrations this node has not yet received during the current migration cycle for this namespace.
  • Logging:
    • migrations remaining: Indicate the number rx/tx remaining and in progress as well as the percent complete.

Also migrate_progress_send to be migtx actively sending Previously this was the number of migrations currently queued to send.


Migrations are stuck for over 1 week