Migrations stuck on 3.8.4

Hi,

We have migrations stuck now for a day. Normally those have been going well, but we added new server to cluster and then added new disks to all NS and we have restarted nodes one by one.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Namespace                   Node   Avail%   Evictions      Master     Replica     Repl     Stop     Pending         Disk    Disk     HWM          Mem     Mem    HWM      Stop
          .                      .        .           .     Objects     Objects   Factor   Writes    Migrates         Used   Used%   Disk%         Used   Used%   Mem%   Writes%
          .                      .        .           .           .           .        .        .   (tx%,rx%)            .       .       .            .       .      .         .
prod                   aero-1:3000   90               0     1.390 M     1.373 M   2        false    (0,0)        48.809 GB   2       60      319.273 MB   1       70     90
prod                   aero-2:3000   90               0     1.436 M     1.101 M   2        false    (45,45)      49.272 GB   2       60      326.411 MB   1       70     90
prod                   aero-3:3000   90               0     1.338 M     1.349 M   2        false    (3,3)        47.833 GB   2       60      317.467 MB   1       70     90
prod                   aero-4:3000   97               0     1.344 M     1.079 M   2        false    (27,27)      48.725 GB   2       60      324.353 MB   1       70     90
prod                                                  0     5.508 M     4.903 M                     (17,17)     194.638 GB                     1.257 GB

aero-1 has 0/0 migrates pending, but others have different amount. we also saw the same kind of warning when restarted the servers than I’ve seen other to have:

Jan 25 2017 17:02:22 GMT: WARNING (migrate): (migrate.c:1007) imbalance: dest refused migrate with ACK_FAIL
Jan 25 2017 17:02:22 GMT: WARNING (partition): (partition.c:1674) {prod:2152} emigrate done: failed with error and cluster key is current

The nodes are otherwise working and are used all the time. What can we do to resolve this issue?

Maybe unrelated, but is this an LDT namespce? Also, can we have more of the logfile here?

The warnings you indicate are not a good sign and only occur if migrations malfunction (bug).

Since the emigrate done message indicates that the cluster key is current, I suspect you hit a known bug addressed in 3.10.0.3 “Principal node may illegally re-enter partition_rebalance if subordinate retransmits PARTITION_SYNC_REQUEST_ACK”. This issue was uncommon, restarting a node should correct any partition map issues.

Do you mean to restart any node or principal node? I forgot to mention in the post that aero-1 is the principal node.

The issue is uncommon but safest to not restart the principle since that is where this issue can manifest.

I stopped one node and waited until all migrations were done before I started the node. This solved our problem.

It seems the problem happens when using restart. All the other nodes start to migrate when restarted node goes down, and starting one node takes some 10 minutes. In that time not all nodes have migrated fully and when the restarted node connects to cluster, it restarts migrations in middle of the old migration.

Some kind of waiting period for migrations to start would solve restart problem I think.

1 Like

Sadly this solution would generate more problems than it solves ;).

There have been other bug fixes in migration since 3.8.4, your symptoms seemed familiar to the one I had mentioned, it is possible you are hitting a separate issue.