Inconsistent result if fetching a key when 1 node crashed on 4 node Aerospike cluster (3.9.0)


#1

Hi all I have a four-nodes Aerospike cluster, the version is 3.9.0 community. I set the replication factor 2, and accidentally one node crashed and disconnect to the cluster. The remaining 3 nodes are working well and I suppose this cluster can work and provide service properly. AMC screenshot is posted as following:

But when I fetch a key by java client, I found sometimes I can get the record, sometimes the record return null. Then I use aql tool instead of java client. It is also wired that on one node, mostly I cannot get this record and return “AEROSPIKE_ERR_RECORD_NOT_FOUND”, but occasionally, I can get this record.

But on another node, mostly I can get the record but occasionally it return AEROSPIKE_ERR_RECORD_NOT_FOUND.

Is it an expected behavior ?


#2

No, this isn’t expected, did it continue after migrations?

Could you provide the configuration of this namespace?


#3

I think there is no special configure. But I find the migration cannot complete. What can I do if the migration cannot complete ?

namespace test {
        replication-factor 1
        memory-size 1G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine memory
}

namespace product {
        replication-factor 2
        memory-size 3G
        high-water-memory-pct 90
        high-water-disk-pct 90
        default-ttl 0 # 30 days, use 0 to never expire/evict.
        storage-engine device {
               file /aerospike/data/product/data1.dat
               file /aerospike/data/product/data2.dat
               file /aerospike/data/product/data3.dat
               file /aerospike/data/product/data4.dat
               #file /data/aerospike/product/data5.dat
               filesize 200g
               write-block-size 1M
               data-in-memory false # Store data in memory in addition to file.
               defrag-startup-minimum  10 # server needs at least 10%
        }
}

namespace staging {
        replication-factor 1
        memory-size 128M
        default-ttl 30d # 30 days, use 0 to never expire/evict.
        storage-engine device {
                file /aerospike/data/staging/staging1.dat
                filesize 2G
                write-block-size 1M
                data-in-memory false
        }

#4

Could provide the results of:

asadm -e "show stat like migra"

#5
[root@Aerospike-ci-2 ~]# asadm -e "show stat like migra"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                        :   10.17.1.11:3000   10.17.1.12:3000   10.17.1.26:3000   10.17.1.7:3000
migrate_allowed             :   true              true              true              true
migrate_partitions_remaining:   515               699               505               528
migrate_progress_recv       :   515               699               505               528
migrate_progress_send       :   515               699               505               528

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~product Namespace Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   10.17.1.11:3000   10.17.1.12:3000   10.17.1.26:3000   10.17.1.7:3000
migrate-order                  :   5                 5                 5                 5
migrate-sleep                  :   1                 1                 1                 1
migrate_record_receives        :   2098066           6407170           1941434           2042125
migrate_record_retransmits     :   45                337               210               124
migrate_records_skipped        :   0                 0                 0                 0
migrate_records_transmitted    :   2685259           5507711           1881326           2413783
migrate_rx_instances           :   0                 0                 0                 0
migrate_rx_partitions_active   :   0                 0                 0                 0
migrate_rx_partitions_initial  :   364               1205              321               351
migrate_rx_partitions_remaining:   0                 0                 0                 0
migrate_tx_instances           :   0                 0                 0                 0
migrate_tx_partitions_active   :   0                 0                 0                 0
migrate_tx_partitions_imbalance:   249               699               329               317
migrate_tx_partitions_initial  :   712               1735              650               692
migrate_tx_partitions_remaining:   249               699               329               271

#7

In aerospike.log I found following warning, is it the root cause ?

Oct 11 2016 05:00:08 GMT: WARNING (migrate): (migrate.c:1002) imbalance: dest refused migrate with ACK_FAIL
Oct 11 2016 05:00:08 GMT: WARNING (partition): (partition.c:1455) {product:537} emigrate done: failed with error and cluster key is current

#8

Yes, those warnings have something to do with it. Make sure your product namespace configurations are the same on all nodes (specifically replication-factor). Also ensure that paxos-single-replica-limit is also configured the same across all nodes.


#9

The replication-factor is the same 2 and paxos-single-replica-limit is not set, so the default value should be 1. Another root cause?


#10

Another possibility is

[AER-5278] - (Cluster) Principal node may illegally re-enter partition_rebalance if subordinate retransmits

Which is addressed in 3.10.0.

This is roughly the command I used to detect that this occurred in our test environment.

grep '' /var/log/aerospike/aerospike.log | grep 'ALLOW MIGRATIONS' | awk '{print $9 " " $10}' | uniq -c | grep -v DISALLOW | uniq -c

The output should look like this:

     81       1 ALLOW MIGRATIONS
      1       2 ALLOW MIGRATIONS
     71       1 ALLOW MIGRATIONS

If you see a line different to “1 ALLOW MIGRATIONS” as here we see “2 ALLOW MIGRATIONS” then this would be the root cause.

This race is expected to be rare since we haven’t yet seen it outside test environment (and only once there).

If you hit this, likely restarting a node will resolve the issue. If you decide to upgrade to 3.10.0, start with principal node, after the principal is upgraded it shouldn’t be possible for this issue to reoccur.


#12

OK, I have modified to $9 and $10. But I didn’t find 2 ALLOW MIGRATIONS

[root@058b2732-8c18-47ef-bac9-a30e0de57404 ~]# grep '' /var/log/aerospike/aerospike.log | grep 'ALLOW MIGRATIONS' | awk '{print $9 " " $10}' | uniq -c | grep -v DISALLOW | uniq -c
     41       1 ALLOW MIGRATIONS

#13

And I didn’t find any repeated ALLOW MIGRATIONS. Therefore I think it is okay. Maybe another issue cause it ?


#14

I’ve no idea now, is there any walked around solution ? for example, can I asbackup all data without no-cluster-change parameter, and delete all original data file, then asrestore from backup data?


#15

Be sure to run on all nodes (particularly the principal) and through archived logs.


#16

Based on the original symptoms in this ticket, we will need to force all partitions to resync, there are 2 methods in 3.9.0:

  1. Perform a rolling restart of the cluster.
  2. I believe 3.9.0 still has dun commands and you can force all partitions to resync by running: (Note: this will cause the cluster to split into single node clusters and then reform.)
    asadm -e "cluster dun all"
    sleep 10
    asadm -e "cluster undun all"
    

#17

If I take method 2, need I run dun command on each node? Or just run on one node ?


#18

The node with star is principal, right?

Admin> info
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           Node               Node                Ip     Build   Cluster            Cluster     Cluster         Principal   Client     Uptime
              .                 Id                 .         .      Size                Key   Integrity                 .    Conns          .
10.17.1.11:3000   BB9B2509D97ADB2    10.17.1.11:3000   C-3.9.0         4   9D6032CC83D13B3B   True        BB9D21EC8B9FBE2      157   22:57:22
10.17.1.12:3000   BB9445B8ACAE692    10.17.1.12:3000   C-3.9.0         4   9D6032CC83D13B3B   True        BB9D21EC8B9FBE2      345   05:21:48
10.17.1.26:3000   BB93D767A9884A2    10.17.1.26:3000   C-3.9.0         4   9D6032CC83D13B3B   True        BB9D21EC8B9FBE2      170   22:57:19
10.17.1.7:3000    *BB9D21EC8B9FBE2   10.17.1.7:3000    C-3.9.0         4   9D6032CC83D13B3B   True        BB9D21EC8B9FBE2      196   22:57:25
Number of rows: 4

#19

just curious, I have two empty namespace, why they also have pending migrates.

Namespace              Node   Avail%   Evictions     Master    Replica     Repl     Stop     Pending         Disk    Disk     HWM          Mem     Mem    HWM      Stop
        .                 .        .           .    Objects    Objects   Factor   Writes    Migrates         Used   Used%   Disk%         Used   Used%   Mem%   Writes%
        .                 .        .           .          .          .        .        .   (tx%,rx%)            .       .       .            .       .      .         .
product     10.17.1.11:3000   85         0.000      5.536 M    3.362 M   2        false    (42,33)     111.189 GB   14      90      665.034 MB   22      90     90
product     10.17.1.12:3000   85         0.000      5.508 M    3.845 M   2        false    (71,26)     111.501 GB   14      90      666.747 MB   22      90     90
product     10.17.1.26:3000   86         0.000      5.197 M    2.969 M   2        false    (22,27)     108.750 GB   14      90      650.457 MB   22      90     90
product     10.17.1.7:3000    85         0.000      5.531 M    2.507 M   2        false    (51,33)     113.008 GB   15      90      676.130 MB   23      90     90
product                                  0.000     21.772 M   12.682 M                     (45,30)     444.449 GB                     2.596 GB
staging     10.17.1.11:3000   99         0.000      0.000      0.000     1        false    (0,42)        0.000 B    0       50        0.000 B    0       60     90
staging     10.17.1.12:3000   99         0.000      0.000      0.000     1        false    (18,0)        0.000 B    0       50        0.000 B    0       60     90
staging     10.17.1.26:3000   99         0.000      0.000      0.000     1        false    (0,39)        0.000 B    0       50        0.000 B    0       60     90
staging     10.17.1.7:3000    99         0.000      0.000      0.000     1        false    (52,44)       0.000 B    0       50        0.000 B    0       60     90
staging                                  0.000      0.000      0.000                       (10,35)       0.000 B                      0.000 B
test        10.17.1.11:3000   99         0.000      0.000      0.000     2        false    (0,18)        0.000 B    0       50        0.000 B    0       60     90
test        10.17.1.12:3000   99         0.000      0.000      0.000     2        false    (79,40)       0.000 B    0       50        0.000 B    0       60     90
test        10.17.1.26:3000   99         0.000      0.000      0.000     2        false    (0,16)        0.000 B    0       50        0.000 B    0       60     90
test        10.17.1.7:3000    99         0.000      0.000      0.000     2        false    (0,17)        0.000 B    0       50        0.000 B    0       60     90
test                                     0.000      0.000      0.000                       (16,25)       0.000 B                      0.000 B
Number of rows: 15

#20

asadm will send it to all nodes.


#21

Yes, the principal is the one witg the star.


#22

The partition versions have diverged, the rebalance layer doesn’t know how many records each node has, and it cannot really use that information even if it did.