Migrations stuck on 3.8.4

messis · January 26, 2017, 1:46pm

Hi,

We have migrations stuck now for a day. Normally those have been going well, but we added new server to cluster and then added new disks to all NS and we have restarted nodes one by one.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Namespace                   Node   Avail%   Evictions      Master     Replica     Repl     Stop     Pending         Disk    Disk     HWM          Mem     Mem    HWM      Stop
          .                      .        .           .     Objects     Objects   Factor   Writes    Migrates         Used   Used%   Disk%         Used   Used%   Mem%   Writes%
          .                      .        .           .           .           .        .        .   (tx%,rx%)            .       .       .            .       .      .         .
prod                   aero-1:3000   90               0     1.390 M     1.373 M   2        false    (0,0)        48.809 GB   2       60      319.273 MB   1       70     90
prod                   aero-2:3000   90               0     1.436 M     1.101 M   2        false    (45,45)      49.272 GB   2       60      326.411 MB   1       70     90
prod                   aero-3:3000   90               0     1.338 M     1.349 M   2        false    (3,3)        47.833 GB   2       60      317.467 MB   1       70     90
prod                   aero-4:3000   97               0     1.344 M     1.079 M   2        false    (27,27)      48.725 GB   2       60      324.353 MB   1       70     90
prod                                                  0     5.508 M     4.903 M                     (17,17)     194.638 GB                     1.257 GB

aero-1 has 0/0 migrates pending, but others have different amount. we also saw the same kind of warning when restarted the servers than I’ve seen other to have:

Jan 25 2017 17:02:22 GMT: WARNING (migrate): (migrate.c:1007) imbalance: dest refused migrate with ACK_FAIL
Jan 25 2017 17:02:22 GMT: WARNING (partition): (partition.c:1674) {prod:2152} emigrate done: failed with error and cluster key is current

The nodes are otherwise working and are used all the time. What can we do to resolve this issue?

Albot · January 28, 2017, 12:36am

Maybe unrelated, but is this an LDT namespce? Also, can we have more of the logfile here?

kporter · January 28, 2017, 1:13am

The warnings you indicate are not a good sign and only occur if migrations malfunction (bug).

Since the emigrate done message indicates that the cluster key is current, I suspect you hit a known bug addressed in 3.10.0.3 “Principal node may illegally re-enter partition_rebalance if subordinate retransmits PARTITION_SYNC_REQUEST_ACK”. This issue was uncommon, restarting a node should correct any partition map issues.

messis · January 28, 2017, 9:59am

Do you mean to restart any node or principal node? I forgot to mention in the post that aero-1 is the principal node.

kporter · January 28, 2017, 9:57pm

The issue is uncommon but safest to not restart the principle since that is where this issue can manifest.

messis · January 30, 2017, 2:16pm

I stopped one node and waited until all migrations were done before I started the node. This solved our problem.

It seems the problem happens when using restart. All the other nodes start to migrate when restarted node goes down, and starting one node takes some 10 minutes. In that time not all nodes have migrated fully and when the restarted node connects to cluster, it restarts migrations in middle of the old migration.

Some kind of waiting period for migrations to start would solve restart problem I think.

kporter · January 30, 2017, 11:48pm

Sadly this solution would generate more problems than it solves ;).

There have been other bug fixes in migration since 3.8.4, your symptoms seemed familiar to the one I had mentioned, it is possible you are hitting a separate issue.

Topic		Replies	Views
Migrations are stuck for over 1 week Migration	12	4298	January 23, 2016
Urgent: Migration stuck v3.8.1, missing acks from node migration	20	4121	June 28, 2017
Inconsistent result if fetching a key when 1 node crashed on 4 node Aerospike cluster (3.9.0) AQL	31	3971	October 14, 2016
Migrations stuck on a node Migration	6	6128	December 3, 2015
Bad performance after upgrade due to migrations Upgrading	9	3278	July 8, 2015

Migrations stuck on 3.8.4

Related topics