Scan/query all records returns only half of the records after a restart of any node

nosferaty · February 14, 2016, 1:19pm

Hi,

We faced with the problem: scan/query all records returns only half of the records after an any node restarting

How to reproduce:

Deploy aerospike cluster in google cloud with 2 nodes as described here: Google Cloud console
Create a simple application with latest Aerospike java client (in my case “3.1.9”) which connects to the cluster and creates 10 records
Every second application queries all records from the set (or scans the set) and expects to receive 10 records
Manually restart one the aerospike nodes. After the restart we will see two possible results:

a) We receive 15 records instead of 10. But in a second this problem disappear

b) We receive 5 records instead if 10. Problem repeats every second. It disappers only when we restart application or recreate AerospikeClient object.

I see why the case “a” happens after the restart (it was described in one of the posts on this forum). But why the problem does not disappear in case “b” until we restart an app or recreate AerospikeClient object again?

UPD: it happens only if app server with java app is located far enough from aerospike cluster (in another region). It is not reproduced if app server is located in the same datacenter with Aerospike.

Thank you

Guy_Sela · February 17, 2016, 11:58am

For a) I guess you didn’t scan with the flag “failScanOnClusterChange”

For b) I have no idea. Does your client connects succesfully to the node after the restart?

Brian · February 24, 2016, 8:17pm

A scan does not guarantee consistency when data migrations are in progress. A scan can detect data migrations and fail if the “failScanOnClusterChange” is true.

nosferaty · March 10, 2016, 9:41pm

Client from another datacenter never reconnects after the node restart. It happens not only after the node restart, but it also happens accidentally. I believe it happens in case of network issues. But node restart helps to reproduce this problem. We tried to reproduce it with the same app in different datacenters (connected to the same Aerospike cluster):

The first one is located in the same datacenter with Aerospike cluster (with small ping). Aerospike client reconnects automatically. getNodeNames().size() becomes 1 and then becomes 2 (when the second Aerospike node becomes available).
The second one is located in another datacenter (with bigger ping). Aerospike client does not reconnect automatically. getNodeNames().size() becomes 1 and never returns back to 2

We created workaround for it: we check client.getNodeNames().size() every 5 seconds. If it changes, we create new AerospikeClient object. An old one is released in about 1 minute. It helped us to avoid this problem in production.

Topic		Replies	Views
Inconsistency in aero: throw a warning when client isn't able to reach all the nodes in a cluster	7	2262	June 16, 2015
Experiencing pauses in data when running AerospikeClient.scanNode Java Client	5	2035	August 20, 2015
scanAll() yields inconsistent results Java Client	4	1833	February 28, 2017
Scan and duplicated records (AER-3648) Query & Indexing	23	5248	May 25, 2015
Scans suddenly stop responding Query & Indexing	4	1971	February 9, 2016

Scan/query all records returns only half of the records after a restart of any node

Related topics