Scan/query all records returns only half of the records after a restart of any node


#1

Hi,

We faced with the problem: scan/query all records returns only half of the records after an any node restarting

How to reproduce:

  1. Deploy aerospike cluster in google cloud with 2 nodes as described here: https://cloud.google.com/launcher/solution/click-to-deploy-images/aerospike

  2. Create a simple application with latest Aerospike java client (in my case “3.1.9”) which connects to the cluster and creates 10 records

  3. Every second application queries all records from the set (or scans the set) and expects to receive 10 records

  4. Manually restart one the aerospike nodes. After the restart we will see two possible results:

a) We receive 15 records instead of 10. But in a second this problem disappear

b) We receive 5 records instead if 10. Problem repeats every second. It disappers only when we restart application or recreate AerospikeClient object.

I see why the case “a” happens after the restart (it was described in one of the posts on this forum). But why the problem does not disappear in case “b” until we restart an app or recreate AerospikeClient object again?

UPD: it happens only if app server with java app is located far enough from aerospike cluster (in another region). It is not reproduced if app server is located in the same datacenter with Aerospike.

Thank you


#2

For a) I guess you didn’t scan with the flag “failScanOnClusterChange”

For b) I have no idea. Does your client connects succesfully to the node after the restart?


#3

A scan does not guarantee consistency when data migrations are in progress. A scan can detect data migrations and fail if the “failScanOnClusterChange” is true.


#4

Client from another datacenter never reconnects after the node restart. It happens not only after the node restart, but it also happens accidentally. I believe it happens in case of network issues. But node restart helps to reproduce this problem. We tried to reproduce it with the same app in different datacenters (connected to the same Aerospike cluster):

  1. The first one is located in the same datacenter with Aerospike cluster (with small ping). Aerospike client reconnects automatically. getNodeNames().size() becomes 1 and then becomes 2 (when the second Aerospike node becomes available).
  2. The second one is located in another datacenter (with bigger ping). Aerospike client does not reconnect automatically. getNodeNames().size() becomes 1 and never returns back to 2

We created workaround for it: we check client.getNodeNames().size() every 5 seconds. If it changes, we create new AerospikeClient object. An old one is released in about 1 minute. It helped us to avoid this problem in production.