Aerospike Java client Query returns only roughly half the total number of records in each set

Hi Aerospike community -

I’m running into a perplexing issue where querying for all records in a set (i.e. no filters, etc) I am consistently getting back roughly half the total number of records in the set. The issue is only reproducible when querying a multi-node cluster, and whether it’s on bare metal or containerized and deployed on Kubernetes doesn’t matter. Using a single node cluster, again regardless where it’s deployed, and the exact same Java client/code does not exhibit this issue, and all the records are returned. I’ve confirmed the “total number of records” by knowing how many records are saved to the set by a script used to load data into set, as well as count under the objects column after issuing an aql show sets command. I’ve tried loading different types and amounts of data into various sets but problem exists across all.

Aerospike server version in use is 4.6.0.15, and I’m using Java client version 4.4.10, although I’ve tried various other versions, as well as a Python client v4.0.0 and still got the issue.

I’m using the query(QueryPolicy policy, Statement statement) method of the Aero client and passing in a default QueryPolicy instance and only setting the namespace and setName on the Statement object, no filters/predicates/etc, which if I understand correctly should be equivalent to using scanAll(). I have also tried using scanAll() with a default scanPolicy, and just passing in the namespace and setName (no bins) and the same behavior is exhibited.

Here is the relevant client code:

try {
            String namespace = <namespace>;
            String setName = <setName>;
            Statement stmt = new Statement();
            stmt.setNamespace(namespace);
            stmt.setSetName(setName);

            QueryPolicy queryPolicy = new QueryPolicy();
            try (RecordSet recordSet = client.query(queryPolicy, stmt)) {
                int setSize = 0;
                while (recordSet.next()) {
                    Key key = recordSet.getKey();
                    Record record = recordSet.getRecord();
                    System.out.println("Key: " + key + ", Record: " + record);
                    setSize++;
                }
                System.out.println("Set Size: " + setSize);
            }
        } catch (AerospikeException e) {
            System.err.println("Error: " + e.getMessage());
        } finally {
            client.close();
        }

Using the above (or directly using scanAll with a callback), for a set name containing 654 records for example, only 317 are being returned.

All of the cluster configs, whether bare metal or on Kubernetes, uses the mesh mode for heartbeats in the network stanza, and a replication factor of 2. We have 2 node and 5 node clusters, in each case the same behavior is exhibited. I can upload the exact configs if needed, but they truly are very “standard” from what I’ve seen.

I have scoured the web and not come across hardly any similar issues, the closest one I’ve found is Total Number of records from aql & fetched through query do not match. However I don’t really understand what the resolution was for the OP, it appears that they just started seeing matching record counts for the various ways of querying. Also, there are some comments in there that say that the number of records under n_objects is the number of master AND replica records, meaning that if there are 3 mil records and replica factor of 2 then n_objects would report 6 mil… however that’s not what I’ve observed - after loading for example 654 records into a particular set in a 2-node cluster with replica factor 2, show sets reports

| disable-eviction | ns        | set-enable-xdr | objects | stop-writes-count | set                | memory_data_bytes | truncate_lut | tombstones |
+------------------+-----------+----------------+---------+-------------------+--------------------+-------------------+--------------+------------+
| "false"          | "default" | "use-default"  | "654"   | "0"               | <setName>       | "331593"        | "0"          | "0"        |

for both nodes, and our AMC dashboard reports twice that number (1308) under the Objects column for that set, but the client query returns 317. Another example I’ve seen with our 5-node cluster (repl factor 2) is that the objects column reports roughly the same number of objects (i.e. 17, 18, 20, 17, 18) and the AMC dashboard reports 90 which is the sum of aforementioned counts, but the above query (or directly using scanAll()) returns 45…

So anyways, any help shedding some light on this issue would be very much appreciated!

Update: the day after reporting, the client is returning 337 records now (was previously 317). The sum of those equal the actual total record count in the set, 654. So this makes it seem like somehow only one node’s worth of data is being returned at a time.

If a node in the cluster is not reachable by the client, that node can’t be added to the client’s view of the cluster. If that’s the case, the query will only query reachable nodes.

Subscribe to the client log and cluster tend errors (like unreachable peer node) will be logged. See Logging | Developer

This is a common error on AWS where the server (by default) returns internal IP addresses for peer nodes and the client is external to AWS. This can be solved by setting access-address to an external IP address in the server config. See General Network Configuration | Aerospike Documentation

2 Likes

Thanks for sharing this information.

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.