FAQ - How are records distributed during node loss from a rack aware cluster?

FAQ - How are records distributed during node loss from a rack aware cluster?

Detail

When a cluster is using the default partition distribution algorithm prefer-uniform-balance, and in rack-aware mode, what happens when nodes are lost from one of the racks. How is record distribution affected?

Answer

When using prefer-uniform-balance the key thing to remember is that the distribution refers to master partitions so that uniform traffic is spread evenly across the cluster. When using rack-aware, an extra dimension is brought into play in that the rack-aware configuration means that if the master copy of a record (or partition) sits in one rack, the replica must sit on the other rack (assuming 2 racks with replication-factor 2).

This implies that when nodes are lost from one rack, that rack will hold less master partitions after migration. To ensure that the rack-aware commitment is maintained, more replica partitions will be stored on that rack.

This can be illustrated easily with an example. The cluster below has the following attributes;

Here is the record distribution on the cluster.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2021-05-10 15:01:23 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace|             Node|Rack|  Repl|    Total|~~~~~~~~~~~Objects~~~~~~~~~~~|~~~~~~~~~Tombstones~~~~~~~~|~~~~Pending~~~~
         |                 |  ID|Factor|  Records|  Master|   Prole|Non-Replica| Master|  Prole|Non-Replica|~~~~Migrates~~~
         |                 |    |      |         |        |        |           |       |       |           |     Tx|     Rx
bar      |172.17.0.3:3000  |   0|     2| 12.435 K| 6.263 K| 6.172 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.4:3000  |   0|     2| 12.650 K| 6.307 K| 6.343 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.5:3000  |   0|     2| 12.448 K| 6.087 K| 6.361 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.6:3000  |   0|     2| 12.467 K| 6.279 K| 6.188 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.7:3000  |   1|     2| 12.370 K| 6.285 K| 6.085 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.8:3000  |   1|     2| 12.606 K| 6.342 K| 6.264 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.9:3000  |   1|     2| 12.625 K| 6.349 K| 6.276 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |87e9d68b8e4b:3000|   1|     2| 12.399 K| 6.088 K| 6.311 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |                 |    |      |100.000 K|50.000 K|50.000 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000

Each node has around 12,500 records and these are evenly split between master and replica records. Now, 2 nodes are removed from rack 1 and the cluster is allowed to migrate to steady state. Here is how the record distribution looks post-migration.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2021-05-10 15:04:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace|             Node|Rack|  Repl|    Total|~~~~~~~~~~~Objects~~~~~~~~~~~|~~~~~~~~~Tombstones~~~~~~~~|~~~~Pending~~~~
         |                 |  ID|Factor|  Records|  Master|   Prole|Non-Replica| Master|  Prole|Non-Replica|~~~~Migrates~~~
         |                 |    |      |         |        |        |           |       |       |           |     Tx|     Rx
bar      |172.17.0.3:3000  |   0|     2| 12.388 K| 8.246 K| 4.142 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.4:3000  |   0|     2| 12.609 K| 8.437 K| 4.172 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.5:3000  |   0|     2| 12.519 K| 8.264 K| 4.255 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.6:3000  |   0|     2| 12.484 K| 8.284 K| 4.200 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |172.17.0.7:3000  |   1|     2| 24.969 K| 8.297 K|16.672 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |646d4a248777:3000|   1|     2| 25.031 K| 8.472 K|16.559 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000
bar      |                 |    |      |100.000 K|50.000 K|50.000 K|    0.000  |0.000  |0.000  |    0.000  |0.000  |0.000

The total number of records in the cluster has not changed but the distribution has changed significantly. The first thing to notice is that the number of master records per node is still pretty much even across the cluster. This is because the calculation is as simple as:

master records per node  = total unique records / # of nodes in cluster = 50,000 / 6 ~ 8333

What has changed, to keep the rack-aware commitment, is the number of replica records in each rack. The number of replica records in each rack has to be the same as the number of master records in the opposing rack. As the racks are now unbalanced, this means that the number of master and replica records in each is unbalanced.

  • Replica records in rack 0:

master records per node * # of nodes in rack 1 = 8333 * 2 ~ 16,666

  • Replica records per node in rack 0:

replica records in rack / # of nodes in rack ~ 16666 / 4 ~ 4167

  • Replica records in rack 1:

master records per node * # of nodes in rack 0 = 8333 * 4 ~ 33,333

  • Replica records per node in rack 1:

replica records in rack / # of nodes in rack = 33333 / 2 ~ 16,666

  • Total records in rack 0 ~ (8333*4)+(4167*4) ~ 50,000

  • Total records in rack 1 ~ (8333*2)+(16666*2) ~ 50,000

In summary, it is evident that when one rack becomes unbalanced due to node loss, the smaller rack will have the same number of overall records however, there will be proportionally fewer masters and more replicas.

Notes

  • It is possible for the smaller rack to hit stop_writes due to the increased amount of records per node. This is true regardless of which partition distribution algorithm is chosen.
  • If space is a concern then the migrate-fill-delay can be used to prevent migrations from filling up and keep a single copy of the partitions that have lost a node.
  • In the example above, there are slight deviations between the amount of records that the theory says should be on the node and the actual counts. This is due to the fact that the algorithm distributes partitions rather than individual records and 4096 is not a multiple of 6. There is a slight difference therefore in the number of partitions per node as some nodes would own one more master copy of a partition (in this case, 4 nodes would own 683 partitions – the ones with 6.2K or 6.3K master records – and 2 would own 682 – the ones with 6.0K master records). The number of record per partition is also not exactly the same, but usually within a couple of percent, per the RIPEMD160 hashing function.

Keywords

RACK AWARE DISTRIBUTION NODE LOSS UNIFORM BALANCE

Timestamp

May 2021

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.