Hi Aerospike community -
I’m running into a perplexing issue where querying for all records in a set (i.e. no filters, etc) I am consistently getting back roughly half the total number of records in the set. The issue is only reproducible when querying a multi-node cluster, and whether it’s on bare metal or containerized and deployed on Kubernetes doesn’t matter. Using a single node cluster, again regardless where it’s deployed, and the exact same Java client/code does not exhibit this issue, and all the records are returned. I’ve confirmed the “total number of records” by knowing how many records are saved to the set by a script used to load data into set, as well as count under the objects
column after issuing an aql show sets
command. I’ve tried loading different types and amounts of data into various sets but problem exists across all.
Aerospike server version in use is 4.6.0.15, and I’m using Java client version 4.4.10, although I’ve tried various other versions, as well as a Python client v4.0.0 and still got the issue.
I’m using the query(QueryPolicy policy, Statement statement)
method of the Aero client and passing in a default QueryPolicy instance and only setting the namespace
and setName
on the Statement object, no filters/predicates/etc, which if I understand correctly should be equivalent to using scanAll()
. I have also tried using scanAll() with a default scanPolicy
, and just passing in the namespace
and setName
(no bins) and the same behavior is exhibited.
Here is the relevant client code:
try {
String namespace = <namespace>;
String setName = <setName>;
Statement stmt = new Statement();
stmt.setNamespace(namespace);
stmt.setSetName(setName);
QueryPolicy queryPolicy = new QueryPolicy();
try (RecordSet recordSet = client.query(queryPolicy, stmt)) {
int setSize = 0;
while (recordSet.next()) {
Key key = recordSet.getKey();
Record record = recordSet.getRecord();
System.out.println("Key: " + key + ", Record: " + record);
setSize++;
}
System.out.println("Set Size: " + setSize);
}
} catch (AerospikeException e) {
System.err.println("Error: " + e.getMessage());
} finally {
client.close();
}
Using the above (or directly using scanAll with a callback), for a set name containing 654 records for example, only 317 are being returned.
All of the cluster configs, whether bare metal or on Kubernetes, uses the mesh
mode for heartbeats in the network stanza, and a replication factor of 2. We have 2 node and 5 node clusters, in each case the same behavior is exhibited. I can upload the exact configs if needed, but they truly are very “standard” from what I’ve seen.
I have scoured the web and not come across hardly any similar issues, the closest one I’ve found is Total Number of records from aql & fetched through query do not match. However I don’t really understand what the resolution was for the OP, it appears that they just started seeing matching record counts for the various ways of querying. Also, there are some comments in there that say that the number of records under n_objects
is the number of master AND replica records, meaning that if there are 3 mil records and replica factor of 2 then n_objects
would report 6 mil… however that’s not what I’ve observed - after loading for example 654 records into a particular set in a 2-node cluster with replica factor 2, show sets
reports
| disable-eviction | ns | set-enable-xdr | objects | stop-writes-count | set | memory_data_bytes | truncate_lut | tombstones |
+------------------+-----------+----------------+---------+-------------------+--------------------+-------------------+--------------+------------+
| "false" | "default" | "use-default" | "654" | "0" | <setName> | "331593" | "0" | "0" |
for both nodes, and our AMC dashboard reports twice that number (1308) under the Objects
column for that set, but the client query returns 317. Another example I’ve seen with our 5-node cluster (repl factor 2) is that the objects
column reports roughly the same number of objects (i.e. 17, 18, 20, 17, 18) and the AMC dashboard reports 90 which is the sum of aforementioned counts, but the above query (or directly using scanAll()
) returns 45…
So anyways, any help shedding some light on this issue would be very much appreciated!
Update: the day after reporting, the client is returning 337 records now (was previously 317). The sum of those equal the actual total record count in the set, 654. So this makes it seem like somehow only one node’s worth of data is being returned at a time.