Less query result during node startup

index
secondary
query

#1

I run an Aerospike cluster v3.4 with two nodes. The cluster contains only 19 records. Using the Java Client API, I wrote a small application that queries all these data periodically:

Statement statement = new Statement();
statement.setNamespace("testNs");
statement.setSetName("testSet");
        RecordSet rs = client.query(null, statement);
        while (rs.next()) {} ...

This work fine usually, but if I run only one node and start up the other while running the query periodically, I will get a result with only 10 record for about 10 seconds, then the number of records climb to 19 again in about 1-2 seconds.

How can I run these query in a way to be sure I always get the correct results?

Configuration has mesh topology set up and only in-memory storage.


#2

I have also run these query from aql, it returned less records similiarly to the Java application.


#3

It seems that when the node is starting up, the following happens (my own theory):

  1. we see all the data that is in the cluster (for example 100 records on 3 nodes with 2 replicas)
  2. the node is added to the node list of the cluster
  3. so the new node is also queried for the query results
  4. the client thinks that the new node has all data
  5. so it queries those keys (records) from the node, that should be there, but those records are not there yet, because the node is still being synced, so the query returns about 66 records only
  6. The result count is increasing as the records are being synced, 66 increases to 100
  7. finally the node is in sync, so we see all data again

So the node is added to the available nodes of the cluster and thus queried also before all the data is replicated there.

I thing the node should be added to the queried nodes only when all data is available there, until then that node should not be considered as an available node.


#4

Thanks for posting this.

My understanding is that if you were not querying for all records in the set, this should behave correctly as long as you are not doing any inserts at the same time.

When you are query the whole set though, the code path is different and ends up on the scan code path… You can change this behavior for scans (use the scanAll method with a ScanPolity specifying true for ‘failOnClusterChange’) which means that while migrations are happening (rebalancing of the data when a node joins or leaves the cluster) the scan would fail rather then giving you incomplete result.

I will try to confirm further and let you know if this is not accurate.


#5

I have added a secondary index and a filter to the statement. Now it is better, but some if the data is still missing for a short period. When we have about 200.000 records in the DB, 1% of the records were missing while the node was starting up. I did not make any inserts at that time.

How can we prevent to miss the 1% of the records? Or at least to know that a node is starting up and we might see inconsistencies?

What should we expect if we also write to the database while the node is starting up?


#6

The behavior of query with secondary index is defined here

http://www.aerospike.com/docs/architecture/secondary-index.html

Secondary index query DOES_NOT go the node unless it has data. It always happens in the QUERY NODE. But given there can be window of transition when due to async nature of querying it may happen that for certain partition none of the node is QUERY NODE at a particular time, query could possibly return less result in that case. If you had 100millon rows instead of 100 then % of records missing will be way less. Also when writes are going on all the new writes are picked up by the query engine.

Regarding scan, it runs with different underlying engine which looks at the master copy to get the data back. So as you observed when cluster changing when master does not have data it is possible that some data may not come back.

– Raj


#7

Hi,

sorry, I do not understand your explanations, please describe it:

According to the page you referenced, if I have a “Node Joining With No Data, Without Fast Restart”, the “Query Consistency during migrations” is Consistent Copy. As I understand this, I should get back the correct data.

Also it says: “Under normal operating environment, Nodes can be simply added and removed from the cluster with complete query availability.” My understanding is that adding a node is a normal operation environment for such databases that is built for scalability.

I did not find a description for what a query node is and that how a query node is selected or created, how does it affect a database query. Is it documented somewhere?

Thanks for the help,

– Sini


#8

There is brief description of the what the QUERY NODE is just below the table in the docs page. Here is some more detail.

Data in aerospike is stored in 4096 partition and data is stored and moved around in units of partition. Given row/record falls in one such partition based on static hash. In any given particular cluster state a node is designated MASTER or REPLICA. When cluster activity happens (node up/down with data in it while writes are going on), there can be many different versions of same partitions present in cluster. In such case when trying to fetch data for a certain partition while performing secondary index query, there are two options

  • To bring data from all the copies of a partition in cluster and remove duplicate rows and return unique rows only. This can either be done at one of the server nodes or at the client. Both of which is not scalable in nature because it either introduces huge intra-cluster traffic or can flood client with lot of data.

  • Just INDEPENDENTLY decide at each node whether to return result back for a certain partition without communicating with any other node in cluster.

We pick the second option above. A secondary index query is sent to all the nodes in the cluster. And each node decide INDEPENDENTLY on whether it should be returning a value for all the row in a given partition or not. For a given partition at any point there is only one such node in cluster. This node is what is referred to as QUERY NODE.

Under normal running cluster MASTER is also the QUERY NODE. But in presence of multiple copies one of nodes other than MASTER can be picked up as QUERY NODE. A node is picked based on function of number of row in partition and position in the succession list. A node having higher number of rows is picked with weightage MASTER > REPLICA > node outside replica list. When the cluster settles and migration finishes MASTER becomes the QNODE again.

Given the async nature of when a QUERY NODE designation changes from MASTER to non-MASTER and back again, it is possible that a when a query is received for a certain instanct none of the node may be see itself as QUERY NODE for a particular partition hence not return result for it. This can happen for brief moment of when query is performed.

The case you are describing though there is only 1 copy of partition there is still a transition happening wherein a new node which is MASTER for a certain partition becomes QUERY NODE post migration. If query hits server when this transition is happening it may not see all the data. With small number of rows many such transition can happen in rapid succession and show up much more pronounced. But when there are large number records such transition are spaced out in timescale.

HTH

– Raj