Partial Data Unavailable Issue While Adding Node to Cluster

Hi

We have a 8 node cluster running, in which we did the following activity :

  1. Stopped a node X
  2. Waited for migrations to finish
  3. Node X was upgraded (instance type changed) and added back to cluster
  4. Immediately after the node was added , the total master objects in the cluster dropped and slowly as the data got migrated to new node the total count reached nomral after 3 hours

Logs on other node clearly shows that,

 (ticker.c:337) {userdata} objects: all 206750689 master 98417381 prole 108333308
Mar 22 2017 03:45:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206752373 master 98418272 prole 108334101
Mar 22 2017 03:45:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206753522 master 98419111 prole 108334411
Mar 22 2017 03:46:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206754253 master 98419907 prole 108334346
Mar 22 2017 03:46:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206754039 master 98420179 prole 108333860
Mar 22 2017 03:46:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206751456 master 98417778 prole 108333678
Mar 22 2017 03:46:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 206749410 master 98415476 prole 108333934
Mar 22 2017 03:46:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 199229603 master 118078813 prole 4
Mar 22 2017 03:46:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189665619 master 91072301 prole 0
Mar 22 2017 03:47:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189664972 master 91072427 prole 0
Mar 22 2017 03:47:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189664914 master 91072786 prole 0
Mar 22 2017 03:47:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189666022 master 91073649 prole 0
Mar 22 2017 03:47:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189667774 master 91074751 prole 0
Mar 22 2017 03:47:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189669361 master 91075825 prole 0
Mar 22 2017 03:47:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189670615 master 91076574 prole 0
Mar 22 2017 03:48:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189671872 master 91077478 prole 0
Mar 22 2017 03:48:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189671989 master 91078042 prole 0
Mar 22 2017 03:48:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189670657 master 91076709 prole 0
Mar 22 2017 03:48:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189669744 master 91075535 prole 0
Mar 22 2017 03:48:42 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189668646 master 91074210 prole 0
Mar 22 2017 03:48:52 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189667295 master 91072865 prole 0
Mar 22 2017 03:49:02 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189667936 master 91073743 prole 0
Mar 22 2017 03:49:12 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189668915 master 91074644 prole 0
Mar 22 2017 03:49:22 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189669988 master 90905736 prole 0
Mar 22 2017 03:49:32 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 189671073 master 90906495 prole 169731

While logs on node X were like this

Mar 22 2017 00:30:57 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177085113 master 81879306 prole 95205807
Mar 22 2017 00:31:07 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177085561 master 81879587 prole 95205974
Mar 22 2017 00:31:17 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177085781 master 81879819 prole 95205962
Mar 22 2017 00:31:27 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177086168 master 81880098 prole 95206070
Mar 22 2017 00:31:37 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177086726 master 81880362 prole 95206364
Mar 22 2017 00:31:47 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 177087175 master 81880630 prole 95206545
Mar 22 2017 03:46:48 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 25851 master 6842 prole 0
Mar 22 2017 03:46:58 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 106924 master 30132 prole 0
Mar 22 2017 03:47:08 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 188335 master 54219 prole 0
Mar 22 2017 03:47:18 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 271483 master 78056 prole 0
Mar 22 2017 03:47:28 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 366172 master 102394 prole 0
Mar 22 2017 03:47:38 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 474971 master 126461 prole 0
Mar 22 2017 03:47:48 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 583461 master 150340 prole 0
Mar 22 2017 03:47:58 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 692755 master 173920 prole 0
Mar 22 2017 03:48:08 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 812242 master 208937 prole 0
Mar 22 2017 03:48:18 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 944404 master 427152 prole 168946
Mar 22 2017 03:48:28 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1079532 master 476160 prole 168946
Mar 22 2017 03:48:38 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1227763 master 525525 prole 168949
Mar 22 2017 03:48:48 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1366423 master 551363 prole 168954
Mar 22 2017 03:48:58 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1527133 master 746939 prole 168955
Mar 22 2017 03:49:08 GMT: INFO (info): (ticker.c:337) {userdata} objects: all 1688367 master 964915 prole 168957

Clearly this is a sign of data unavailability which was not expected. Is this an expected behavior ?

We are using 3.9.1 version , with following config:

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid

        ## SET TO NUMBER OF CORES ##
        service-threads 8
        transaction-queues 8
        scan-threads 8
        ###########################

        ## DONT CHANGE ##
        transaction-threads-per-queue 3
        proto-fd-idle-ms 600000
        proto-fd-max 100000
        batch-max-requests 10000
        migrate-threads 1
        replication-fire-and-forget true
        ##########################
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002

                mesh-seed-address-port 10.0.23.96 3002
            ...................
          interval 150
                timeout 20
        }

        fabric {
                port 3001
        }

        info {  
                port 3003
        }
}

namespace userdata {
        replication-factor 2
        #### CHANGE FOR INSTANCE ###
        memory-size 60G
        ############################
        default-ttl 0 # 30 days, use 0 to never expire/evict.
        storage-engine device {
                ## COLD START AND NO SHADOW DEVICE ##
                cold-start-empty true
                device  /dev/nvme0n1
                #####################################
                ### 1MB FOR INSTANCE STORE ###
                write-block-size 1024K
                #############################
                max-write-cache 1024M
        }
        migrate-sleep 100
#       storage-engine memory
}
namespace user_config_data {
        replication-factor 2
        memory-size 1G
        default-ttl 0
        storage-engine device {
                cold-start-empty true
                file /dev/aerospike/user_config_data.dat
                filesize 1G
                write-block-size 1024K
        }
}
1 Like

Why is this a sign of data unavailability? Was any of your data unavailable from the client side while the node that was added was coming up? I see that you are using “cold-start-empty true” in your config. So this node’s share of the data from the rest of the cluster has to be migrated on to this node. But while migrations are going on, the client will be served from the rest of the cluster which has the data. Would you kindly elaborate on what you mean by data unavailability? Read more here about migrations: http://www.aerospike.com/docs/operations/manage/migration

I agree that this seems alarming! These counts try to not over report the number of objects, but may under report the object count during migrations. Basically, any partition that has an inbound migration will not report objects for that partition.

Hopefully we can make this less alarming in the near future :slight_smile: .

1 Like

Yes, some of the data was unavailabe to client which was the first point of observation after which I checked aerospike stats and found this. And from the logs also it is visible the drop in master object count on nodes.

By data unavailability I mean , total master object count in the cluster dropped from ~694 mill to ~620mill instantaneously and it took 3 hours for this count to reach ~694 mill. During this period we got errors on the client that record not exist for some keys

One reason I suspect that can be is , I added the node X when previous ongoing migrations (due to node removal ) might not have 100% finished. But even in this scenario is such an outcome expected ?

You said you were using ver 3.9.1 … I have read some issues with batch reads during migrations in ver 3.6.x or earlier. I would think this (record not found) is not expected behavior. @kporter – your thoughts?

It is recommended to make a back up of the data before adding any node and not to add a node till migrations have completed. Adding / removing nodes in quick succession can at worst cause loss of data - hence the recommendation to do a backup, but not being able to read data that exists on the cluster is not making sense to me.

Didn’t meant to say that not able to read data that existed on cluster. I meant some data was unavailable on the cluster because of which some keys being fetched got “key not exist” response.

By the look of stats , it looks like the data that was being migrated to new node was unavailable and as migration finished whole data was available

This is odd since the new node was empty - correct?

It isn’t unlikely if the node had been out of the cluster for a while and returned with data. It is far less likely to receive not-founds if the node was added empty, though there are scenarios that may result in these.

If you were using read-consistency-all these shouldn’t happen in either case. Read more about consistency levels here:

By default read-consistency returns the data the requested server has even if the server knows other nodes may have a different copy of the requested record. ‘All’ instructs the read operation to resolve all other possible holders of this record and return the latest copy. This allows reads to be fast but potentially stale.

For writes the default policy is to resolve all versions before applying the write. This means that if the client were performing a ‘read-modify-write’ operation and received a “not found” and then issued a write using policy create-only, it would fail since the writes default consistency level would force the write to find the record. Similarly if the write were to find the a record which was stale it should update the record with the policy gen-equal where the write would fail if the requested server finds the record in the cluster with a later generation than what the write perscriped.

A bit of a preview into the future, 3.13.0 is expected to be capable of selecting a better master partition which will nearly eliminate this hassle. The exceptions are when more than replication-factor nodes drop and later return to the cluster and/or network-partitioning.

I know that this is a rather old thread, but the problem is still there in 3.14.1.3. I wonder if this is something that clearly got resolved later and there is a good reason to go through the hassle of upgrade, or not.

To be more specific, when adding new one or more nodes, some records become “invisible” to the client until they migrate to their final destination. Those records do not show up when scanning sets, they do not show up when doing index lookups, they are not available for reads by their primary key. Updates by primary key also fail.

Eventually their partitions migrate and everything is fine. But if you double the cluster size, half of all records in the database will become unavailable/invisible at first and then it will gradually improve over time.

This functionally isn’t expected. If less than replication-factor nodes are missing from the cluster, worse case behavior during migrations is the possibility of additional latency.

Could you share exactly how you are observing these failures?

During migrations, scans and secondary index query results may have missing or duplicate records.

This thread never confirmed a problem. The master/prole stats dipping during migrations is expected and doesn’t indicate unavailability. We have since introduced a non_replica_objects metric which reports the number of objects which are not currently acting as master or prole.

1 Like

Sorry, saw the reply and had no time to respond. We had to figure out the solution rather fast, and I found a workaround that somehow worked.

First, the scenario where the issue could be seen: Add brand new node to the cluster (same version as the rest - 3.14.1.3)

Now, the scenario in which it all went smooth: Shut down the service on existing node, wipe the disk to avoid all the deleted zombie data coming back, start the service again.

We had to reboot the services to change the IPs that they were advertising, so it was all for making config changes.

Replication factor is 2, and we initially were only adding servers and not removing them. Data is stored only on disk.

When we faced the issue with migrations, here is what I observed:

  1. part of the data (1/5th if adding 5th node to 4 node cluster) became invisible to scans and reads by primary or secondary keys.
  2. once the record was migrated to the final destination it would become visible again, so the records were slowly coming back online.
  3. the brand new server reported massive outgoing migrations. As if the records had to abandoned their secondary copies, migrate the master copy to the new box and then send the copy into the new location, while making the record invisible until the process is over.

Keep in mind that we are using version 3.14.1.3, so the issue might have been fixed in the later releases, though I have not seen anything about it in the release notes.

Btw, it would be nice to have a setting in config that would allow us to ignore particular namespaces from other nodes in the cluster. This would allow us to migrate to 4.x and drop the unused test namespaces in one cycle instead of doing one cycle of restarts just to drop the namespaces. But that’s not related to the main topic of this thread.

This already exists since 3.14 (3.13 post paxos-protocol jump). You can add/remove namespace via rolling upgrade.

During migrations, scans run with the policy failOnClusterChange = true are expected to fail. If this policy is false then the scan should proceed but it would then be possible to miss or have duplicate results for the partitions which are migrating. Secondary indices will behave as scans with this this policy off.

Primary key reads should work fine during this time, there is an older batch API which could fail on these records (since it didn’t proxy) but this was replaced a very long time ago so it is unlikely you are running a client which defaults to this API or that you have opted for this API (the latest servers and clients have removed this option).

That is odd, a few possibilities:

  1. this node came in with data or
  2. there it somehow couldn’t see the other nodes and formed a single node cluster before joining or
  3. there were previously uncompleted migrations.