Migration appears to ignore migrate order


#1

Migration appears to ignore migrate order

Problem Description

From Aerospike 3.7.5 and higher it is possible to configure a migrate-order which specifies the order in which namespaces should be shipped during migrations. When the logs are examined, it can appear that the migrate-order is not being respected. The namespaces are configured as follows:

namespace COUNTRY {
  replication-factor 16
  memory-size 100M
  default-ttl 0 # use 0 to never expire/evict
  migrate-order 1

  storage-engine device {
    cold-start-empty true
    file /opt/aerospike/data/COUNTRY.dat
    filesize 500M
    data-in-memory true # store data in memory in addition to file.
    write-block-size 128K # adjust block size to make it efficient for SSDs.
  }
}

namespace STATE {
  replication-factor 2
  memory-size 8G
  default-ttl 30d # 30 days
  migrate-order 3

  storage-engine device {
    cold-start-empty true
    file /opt/aerospike/data/STATE.dat
    filesize 8G
    write-block-size 128K # adjust block size to make it efficient for SSDs.
  }
}

namespace CITY {
  memory-size 100M
  default-ttl 0 # use 0 to never expire/evict
  migrate-order 2

  storage-engine device {
    file /opt/aerospike/data/CITY.dat
    filesize 500M
    data-in-memory true # store data in memory in addition to file.
    write-block-size 128K # adjust block size to make it efficient for SSDs.
  }
}

When checking the logs during migration the following results are observed:

Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4856) {COUNTRY} memory bytes used 424813 (index 84160 : sindex 0 : data 340653) : used pct 0.41
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4902) {COUNTRY} migrations - remaining (748 tx, 1052 rx), active (0 tx, 1 rx), 75.78% complete
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4854) {STATE} disk bytes used 1505658880 : avail pct 79
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4856) {STATE} memory bytes used 1201284267 (index 75282944 : sindex 58402856 : data 1067598467) : used pct 13.98
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4902) {STATE} migrations - remaining (1287 tx, 1120 rx), active (3 tx, 1 rx), 48.71% complete
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4854) {CITY} disk bytes used 512 : avail pct 99
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4856) {CITY} memory bytes used 222 (index 128 : sindex 0 : data 94) : used pct 0.00
Jul 17 2016 07:12:29 GMT: INFO (info): (thr_info.c:4902) {CITY} migrations - remaining (2 tx, 1433 rx), active (0 tx, 0 rx), 80.69% complete

The expectation from the configured migrate-order would be that the COUNTRY namespace would migrate first, then CITY and then STATE, however we can see that CITY has made more progress than COUNTRY. It appears as though the migrate-order is not being respected.

Explanation

This is designed behaviour. An extract from the collectinfo for the cluster shows as follows:

====ASCOLLECTINFO====
[INFO] Data collection for ['namespace'] in progress..
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Node   Namespace   Avail%   Evictions      Master     Replica     Repl     Stop         Disk    Disk     HWM          Mem     Mem    HWM      Stop   
         .           .        .           .     Objects     Objects   Factor   Writes         Used   Used%   Disk%         Used   Used%   Mem%   Writes%   
10.0.0.100   country          99           0   484.000     821.000          3   false    688.375 KB       1      50   426.961 KB       1     60        90   
10.0.0.101   country          99           0   406.000     899.000          3   false    688.375 KB       1      50   426.961 KB       1     60        90   
O            country          99           0   415.000     890.000          3   false    688.375 KB       1      50   426.961 KB       1     60        90   
10.0.0.100   city             99           0     1.000       2.000          3   false    768.000 B        1      50   333.000 B        1     60        90   
10.0.0.101   city             99           0     1.000       2.000          3   false    768.000 B        1      50   333.000 B        1     60        90   
O            city             99           0     1.000       2.000          3   false    768.000 B        1      50   333.000 B        1     60        90   
10.0.0.100   state            74           0   688.331 K   707.030 K        2   false      1.663 GB      21      50     1.312 GB      17     85        90   
10.0.0.101   state            75           0   676.718 K   672.316 K        2   false      1.608 GB      21      50     1.263 GB      16     85        90   
O            state            75           0   677.839 K   663.542 K        2   false      1.599 GB      20      50     1.262 GB      16     85        90   
Number of rows: 9

The collectinfo shows that the COUNTRY and CITY namespaces are very small in comparison to the STATE namespace. In particular the CITY namespace has only 2 objects and is approximately a tenth of the size of the COUNTRY namespace. The fact that the replication factor for the COUNTRY namespace is 16 is also of significance.

During migration it is quite possible that not all namespaces will be present in the migrate queue. One of the internal metrics used to measure migration is emigration instances. This records the amount of partitions a node wishes to ship out during a migration at a given time. If a node has not queued any instances for a namespace low in the order then a namespace with a higher order but queued emigration instances will be allowed to migrate.

But why would a lower order namespace not have scheduled emigration instances? A node will not schedule emigrations to a prole until all immigrations to that node are completed. So if the master and prole are out of sync, it is possible that a low order namespace is not queuing emigration instances as it is waiting for immigrations to complete.

Because the master waits for immigrations to complete before scheduling emigrations the amount of work on which the master is waiting becomes significant. Normally this would be dictated by the number of objects within the namespace however in the example above replication factor also plays a part. The master needs to wait until immigrations from all 15 proles are completed before outgoing emigrations can be scheduled.

Forcing a strict migrate-order where higher order namespaces can only migrate once lower order namespaces have finished would result in nodes being idle while lower order namespaces were waiting for immigrations to complete.

In the example given above, a very high replication factor is influencing the amount of work the master is waiting on however, anything that increases that wait will influence the behaviour of migrate-order. In versions where Rapid Rebalance is available, this could be a wait due to more records being immigrated from a frequently changing namespace.

The behaviour described in this article should be understood when defining a migrate-order. For example, if a large in-memory namespace is configured to migrate first, but waiting for immigrations, if the second namespace is based on rotational disks, this could, potentially, migrate very slowly meaning that the migrate process bottlenecks and that the in-memory namespace cannot migrate for some time after it has become ready. For this reason a nuanced approach should be taken when setting migrate_order.

Solution

This is designed behaviour to maximise efficiency.

Notes

  • Details on migrate order and how it can be used to speed migrations.

http://www.aerospike.com/docs/reference/configuration#migrate-order

Keywords

MIGRATE ORDER NAMESPACE REBALANCE

Timestamp

7/18/16