Partial Data Unavailable Issue While Adding Node to Cluster


#1

Hi

We have a 8 node cluster running, in which we did the following activity :

  1. Stopped a node X
  2. Waited for migrations to finish
  3. Node X was upgraded (instance type changed) and added back to cluster
  4. Immediately after the node was added , the total master objects in the cluster dropped and slowly as the data got migrated to new node the total count reached nomral after 3 hours

Logs on other node clearly shows that,

 (ticker.c:337) {userdata} objects: all 206750689 master 98417381 prole 108333308
Mar 22 2017 03:45:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206752373 master 98418272 prole 108334101
Mar 22 2017 03:45:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206753522 master 98419111 prole 108334411
Mar 22 2017 03:46:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206754253 master 98419907 prole 108334346
Mar 22 2017 03:46:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206754039 master 98420179 prole 108333860
Mar 22 2017 03:46:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206751456 master 98417778 prole 108333678
Mar 22 2017 03:46:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206749410 master 98415476 prole 108333934
Mar 22 2017 03:46:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 199229603 master 118078813 prole 4
Mar 22 2017 03:46:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189665619 master 91072301 prole 0
Mar 22 2017 03:47:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189664972 master 91072427 prole 0
Mar 22 2017 03:47:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189664914 master 91072786 prole 0
Mar 22 2017 03:47:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189666022 master 91073649 prole 0
Mar 22 2017 03:47:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189667774 master 91074751 prole 0
Mar 22 2017 03:47:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189669361 master 91075825 prole 0
Mar 22 2017 03:47:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189670615 master 91076574 prole 0
Mar 22 2017 03:48:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189671872 master 91077478 prole 0
Mar 22 2017 03:48:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189671989 master 91078042 prole 0
Mar 22 2017 03:48:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189670657 master 91076709 prole 0
Mar 22 2017 03:48:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189669744 master 91075535 prole 0
Mar 22 2017 03:48:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189668646 master 91074210 prole 0
Mar 22 2017 03:48:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189667295 master 91072865 prole 0
Mar 22 2017 03:49:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189667936 master 91073743 prole 0
Mar 22 2017 03:49:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189668915 master 91074644 prole 0
Mar 22 2017 03:49:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189669988 master 90905736 prole 0
Mar 22 2017 03:49:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189671073 master 90906495 prole 169731

While logs on node X were like this

Mar 22 2017 00:30:57 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177085113 master 81879306 prole 95205807
Mar 22 2017 00:31:07 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177085561 master 81879587 prole 95205974
Mar 22 2017 00:31:17 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177085781 master 81879819 prole 95205962
Mar 22 2017 00:31:27 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177086168 master 81880098 prole 95206070
Mar 22 2017 00:31:37 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177086726 master 81880362 prole 95206364
Mar 22 2017 00:31:47 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177087175 master 81880630 prole 95206545
Mar 22 2017 03:46:48 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 25851 master 6842 prole 0
Mar 22 2017 03:46:58 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 106924 master 30132 prole 0
Mar 22 2017 03:47:08 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 188335 master 54219 prole 0
Mar 22 2017 03:47:18 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 271483 master 78056 prole 0
Mar 22 2017 03:47:28 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 366172 master 102394 prole 0
Mar 22 2017 03:47:38 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 474971 master 126461 prole 0
Mar 22 2017 03:47:48 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 583461 master 150340 prole 0
Mar 22 2017 03:47:58 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 692755 master 173920 prole 0
Mar 22 2017 03:48:08 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 812242 master 208937 prole 0
Mar 22 2017 03:48:18 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 944404 master 427152 prole 168946
Mar 22 2017 03:48:28 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1079532 master 476160 prole 168946
Mar 22 2017 03:48:38 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1227763 master 525525 prole 168949
Mar 22 2017 03:48:48 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1366423 master 551363 prole 168954
Mar 22 2017 03:48:58 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1527133 master 746939 prole 168955
Mar 22 2017 03:49:08 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1688367 master 964915 prole 168957

Clearly this is a sign of data unavailability which was not expected. Is this an expected behavior ?

We are using 3.9.1 version , with following config:

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid

        ## SET TO NUMBER OF CORES ##
        service-threads 8
        transaction-queues 8
        scan-threads 8
        ###########################

        ## DONT CHANGE ##
        transaction-threads-per-queue 3
        proto-fd-idle-ms 600000
        proto-fd-max 100000
        batch-max-requests 10000
        migrate-threads 1
        replication-fire-and-forget true
        ##########################
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002

                mesh-seed-address-port 10.0.23.96 3002
            ...................
          interval 150
                timeout 20
        }

        fabric {
                port 3001
        }

        info {  
                port 3003
        }
}

namespace userdata {
        replication-factor 2
        #### CHANGE FOR INSTANCE ###
        memory-size 60G
        ############################
        default-ttl 0 # 30 days, use 0 to never expire/evict.
        storage-engine device {
                ## COLD START AND NO SHADOW DEVICE ##
                cold-start-empty true
                device  /dev/nvme0n1
                #####################################
                ### 1MB FOR INSTANCE STORE ###
                write-block-size 1024K
                #############################
                max-write-cache 1024M
        }
        migrate-sleep 100
#       storage-engine memory
}
namespace user_config_data {
        replication-factor 2
        memory-size 1G
        default-ttl 0
        storage-engine device {
                cold-start-empty true
                file /dev/aerospike/user_config_data.dat
                filesize 1G
                write-block-size 1024K
        }
}

#2

Why is this a sign of data unavailability? Was any of your data unavailable from the client side while the node that was added was coming up? I see that you are using “cold-start-empty true” in your config. So this node’s share of the data from the rest of the cluster has to be migrated on to this node. But while migrations are going on, the client will be served from the rest of the cluster which has the data. Would you kindly elaborate on what you mean by data unavailability? Read more here about migrations: http://www.aerospike.com/docs/operations/manage/migration


#3

I agree that this seems alarming! These counts try to not over report the number of objects, but may under report the object count during migrations. Basically, any partition that has an inbound migration will not report objects for that partition.

Hopefully we can make this less alarming in the near future :slight_smile: .


#4

Yes, some of the data was unavailabe to client which was the first point of observation after which I checked aerospike stats and found this. And from the logs also it is visible the drop in master object count on nodes.

By data unavailability I mean , total master object count in the cluster dropped from ~694 mill to ~620mill instantaneously and it took 3 hours for this count to reach ~694 mill. During this period we got errors on the client that record not exist for some keys

One reason I suspect that can be is , I added the node X when previous ongoing migrations (due to node removal ) might not have 100% finished. But even in this scenario is such an outcome expected ?


#5

You said you were using ver 3.9.1 … I have read some issues with batch reads during migrations in ver 3.6.x or earlier. I would think this (record not found) is not expected behavior. @kporter – your thoughts?

It is recommended to make a back up of the data before adding any node and not to add a node till migrations have completed. Adding / removing nodes in quick succession can at worst cause loss of data - hence the recommendation to do a backup, but not being able to read data that exists on the cluster is not making sense to me.


#6

Didn’t meant to say that not able to read data that existed on cluster. I meant some data was unavailable on the cluster because of which some keys being fetched got “key not exist” response.

By the look of stats , it looks like the data that was being migrated to new node was unavailable and as migration finished whole data was available


#7

This is odd since the new node was empty - correct?

It isn’t unlikely if the node had been out of the cluster for a while and returned with data. It is far less likely to receive not-founds if the node was added empty, though there are scenarios that may result in these.

If you were using read-consistency-all these shouldn’t happen in either case. Read more about consistency levels here:

By default read-consistency returns the data the requested server has even if the server knows other nodes may have a different copy of the requested record. ‘All’ instructs the read operation to resolve all other possible holders of this record and return the latest copy. This allows reads to be fast but potentially stale.

For writes the default policy is to resolve all versions before applying the write. This means that if the client were performing a ‘read-modify-write’ operation and received a “not found” and then issued a write using policy create-only, it would fail since the writes default consistency level would force the write to find the record. Similarly if the write were to find the a record which was stale it should update the record with the policy gen-equal where the write would fail if the requested server finds the record in the cluster with a later generation than what the write perscriped.


#8

A bit of a preview into the future, 3.13.0 is expected to be capable of selecting a better master partition which will nearly eliminate this hassle. The exceptions are when more than replication-factor nodes drop and later return to the cluster and/or network-partitioning.