Fetch Secondary Index results from a Single Node using replication factor 128

Server Version: 6.3

Two namespaces: one which will have millions of records and will be queried by PK only and other which will have < 100 records and will be queried using secondary indexes. This question is regarding second namespace.

As we all know that in a cluster of n nodes with m records keys, each node will be master of ~ m/n keys. A secondary index queries goes to all nodes and fetch data related to only master keys on a particular node and aggregation happens at client side. Now, if replication factor = n (128, no. of nodes), then each node will have all the data related to that namespace.

Is there a way to avoid calls to all nodes in the cluster and fetch all data from any 1 node (master + replica keys) for a secondary index query?


Interesting question - even if we made each node its own rack-id and tried to do a preferred rack read, I don’t think SI query honors that - even though internally the nodes have SI b-trees for both master and replica partitions. To the best of my knowledge, this is not supported. Take this as a placeholder response. Someone else may chime in as well. (Update: As I was corrected later, this can be done and the rest of the thread explores this with conclusive test data at the end.)

Thanks @pgupta for the prompt response.

Alternate approach could be to Reverse Index SI Index bin using Aerospike Lists/Maps. This will require to maintain multiple copies of same data and keeping them in sync. The solution will get more complicated when there are multiple bins with SI Index in same set.

Could this be taken as a feature request? This way SI Index query latency will be independent of cluster size.


I would like to understand why you want the SI query to go to only one node? (You have server version 6.3). If you have 128 nodes, each node has ~ 4096/128 master partitions ownership. The client runs SI query spreading it by partitions, knows which node is master of which partition, distributing the work across all nodes. Why would you want all the work to be handled by a single node? That will make the query performance worse by removing the search parallelism. (Whether 4K partitions’ SI b-trees are being searched on one node vs 128 in parallel - the work is the same.)

The argument for Reverse Index is that you only have 100 records that will be found in 4K partitions search - i.e. many partitions will not find a match - so wasted effort. In that case, building your own reverse index can help. As you noted, it really becomes a multi-record transaction to maintain consistency. (I don’t think it will be an easy feature request to implement at server level.)

I have some content that may help from a data model perspective - its from a while back but never-the-less helpful. (i.e. single bin and data-in-index is now deprecated - ignore the sizing calculations part.)

Update Sequence for avoiding hung pointers:

  • 1 - Update Lookup Table entry of new or updated userIDAttr
  • 2 - Update the userID – data / userIDAttr list record
  • 3 - Update Lookup Table to delete stale userIDAttr entry (if applicable in case userIDAttr was modified)
1 Like

Thanks @pgupta for a sharing the screenshots.

Feature request is for Aerospike Client where if replication factor is 128 then there can be an option (maybe in QueryPolicy) to fetch data from one node instead of distributing it across all nodes (assuming latency difference for smaller namespaces in both approaches are insignificant).

Feature request is not for building Reverse Index at Server level. Building Reverse Indexing on our own is the alternate approach and I was also thinking along the same lines as the approach shared in screenshots.

I would like to understand why you want the SI query to go to only one node?

In context of smaller namespace with < 100 records, I am trying to optimise the client connections to all nodes if the difference in latency for SI Index query on 1 node (as proposed) vs all nodes (as currently) is insignificant. Although in current implementation the work happens in parallel, I think latency might increase with increase in number of nodes. Will get some data around this via load performance test.


For the cluster you are running, I am guessing your have an Enterprise Edition license. If I was to take a guess, you are probably serving some kind of catalogue on Aerospike, size enough to be on a single node, infrequent update, very low latency read requirements with an associated SI query, to a bank of application servers. And your driving concern is the number of file handles being opened for each socket. One SI query will use one socket to each node currently. You want to do everything on a single node. So you want to reduce the number of sockets open by the bank of application servers and don’t mind the extra latency of running the SI query off a single node.

For this kind of feature request, you might want to start with a Support ticket or reach out to your Sales Engineer associated with your account. There may be other customers who may have asked for something similar. In the meantime, I will do my part to bring attention to your request internally.

Update: I have been corrected internally. Some other customer did request this feature and it was added last year. Hence, I am told, it is possible to do this - I am working on testing it, I will get back shortly.

==> Feature request is for Aerospike Client where if replication factor is 128 then there can be an option (maybe in QueryPolicy) to fetch data from one node instead of distributing it across all nodes (assuming latency difference for smaller namespaces in both approaches are insignificant).

Here is how to do it. I set up a 3 node cluster. Each node has a namespace test. I use replication factor 3 - so each node has full copy of data.

Additionally, namespace test each node has its own rack-id. So node A1 uses rack-id 1, node A2 uses rack-id 2 and node A3 uses rack-id 3.

rack-id is a dynamic parameter. So you can add it on an existing cluster and update the configuration file for future restarts of the node.

Here is my namespace configuration for node id A1:

namespace test {

  replication-factor 3
  rack-id 1

  default-ttl 5d # 5 days, use 0 to never expire/evict.
  nsup-period 15
  storage-engine device {
    file /opt/aerospike/data/test.dat
    filesize 5G

I have added 10 test records as shown below:

Initialized the client and connected to the cluster.
key0 : (gen:1),(exp:453789840),(bins:(name:Sandra),(age:34))
key1 : (gen:1),(exp:453789840),(bins:(name:Jack),(age:26))
key2 : (gen:1),(exp:453789840),(bins:(name:Jill),(age:20))
key3 : (gen:1),(exp:453789840),(bins:(name:James),(age:38))
key4 : (gen:1),(exp:453789840),(bins:(name:Jim),(age:46))
key5 : (gen:1),(exp:453789840),(bins:(name:Julia),(age:62))
key6 : (gen:1),(exp:453789840),(bins:(name:Sally),(age:32))
key7 : (gen:1),(exp:453789840),(bins:(name:Sean),(age:24))
key8 : (gen:1),(exp:453789840),(bins:(name:Sam),(age:12))
key9 : (gen:1),(exp:453789840),(bins:(name:Susan),(age:42))

Quickly check data distribution using asadm -e info command:

I created a secondary index on the age bin as follows:

asadm --enable -e "manage sindex create numeric idx_age ns test set testset bin age"

For running the query, here is my client code.

//Instantiate client object with Preferred Rack ClientPolicy
//Here, this client is indicating, its preferred rack is with rack-id=1.
ClientPolicy cp = new ClientPolicy();
cp.rackId = 1;   //Next, changed to 2 and then 3, for testing. 
cp.rackAware = true;
AerospikeClient client = new AerospikeClient(cp, "localhost", 3000);

//Run SI query
Statement stmt = new Statement();
stmt.setFilter(Filter.range("age", 20,30));
QueryPolicy qp = new QueryPolicy();

//Specify query to use preferred rack
qp.replica = Replica.PREFER_RACK;

RecordSet rs = client.query(qp, stmt);

    Record r = rs.getRecord();
    Key thisKey = rs.getKey();  

//Close this client

Code output (should be same for each test with different rack-id):


I am using Jupyter Notebook, so I can change the preferred rack Id and validate that I am still getting the correct results and the query is going to a single node by watching the log ticker output as below:

Only the ticker for the preferred rack should bump up in count of long-basic.

$ sudo cat /var/log/aerospike/aerospike.log |grep si-query

May 14 2024 04:59:03 GMT: INFO (info): (ticker.c:885) {test} si-query: short-basic (0,0,0) long-basic (5,0,0) aggr (0,0,0) udf-bg (0,0,0) ops-bg (0,0,0)

The long-basic (5,0,0) is the number that I am watching on each node. The query is only getting executed on the one node with preferred rack-id.

1 Like

Thanks @pgupta. This is really helpful.

I have a few doubts as listed below:

  1. Which version have you tested this on? I am trying this with server version: 6.3 and client version: 4.4.18 and it’s not working.
  2. Does it also work for QueryPolicy with Replica.MASTER_PROLES, Replica.RANDOM? If yes, then I believe client doesn’t need to be rack aware.
  3. Is there any official blog related to performance difference in both approaches?


I tested on latest - server E- (see asadm info screeshot), Java Client 8.1.1 / aerospike-client-jdk8


This feature was added almost a year ago. See https://download.aerospike.com/download/client/java/notes.html - Java client version 6.1.10 - JIRA: CLIENT-2073 - May 18, 2023.

2- Replica.MASTER_PROLES, Replica.RANDOM - I don’t think that will give you what you want. Again, I will have to check internally and get back.

3 - Sorry no performance blog - I think you are the first person trying to do this in RF=ALL - all data in each node, go to single node. I thought of doing this with PREFER_RACK as a way to achieve what you want.

Thanks. I will try with the mentioned client version. I was using below client.


Our use case is as you mentioned, one is to serve catalogue on Aerospike and other is to fetch records at a configure interval (< 5 mins) via cron and maintain application cache required for high throughput user facing API. Both require use of secondary indexes. Application Servers will be > 50, and Aerospike Cluster would have at least 6 nodes. Both are expected to increase with time.

I was just curious to see if no. of records is small and latency difference is not that much then it might be optimised to fetch all the records from a single node and distribute reads across cluster using Replica.RANDOM. Application Servers are spread across 5-6 racks and maybe unequally distributed whereas Aerospike Cluster is equally distributed in 3 racks. I will probably do a performance test and will share the results here once done.


Please note the change in the artifactId - in the POM. Install the Aerospike client | Developer

Tried with below version


It works for Replica.PREFER_RACK but getting below error for Replica.RANDOM

Exception in thread "main" com.aerospike.client.AerospikeException: Error 4: Invalid replica: RANDOM
	at com.aerospike.client.query.PartitionTracker.init(PartitionTracker.java:194)
	at com.aerospike.client.query.PartitionTracker.<init>(PartitionTracker.java:78)
	at com.aerospike.client.query.PartitionTracker.<init>(PartitionTracker.java:62)
	at com.aerospike.client.AerospikeClient.query(AerospikeClient.java:2787)

Is Replica.Random not a valid parameter for client.query?


Most likely not - will research and get back later today.

Replica.RANDOM is not valid for query because the chosen node must contain the requested partition and query does not proxy. Any random node will not work in a general case since a query is not proxied. It so happens in your case, because of RF=all, you won’t have a situation that the node does not have the partition - but for RF < number of nodes, Replica.RANDOM would not work, hence not supported.

Replica.MASTER_PROLES is supported for queries - I don’t see you getting much benefit out of it. What it means is: Client will send query to node holding master partition for the first partition, to node holding replica 1 for the next partition, and so on, up to replica N for each successive partition. Then wrap-around to master and repeat process for every remaining partition.

Also, if a query to a partition fails due to connection errors or timeouts, or device overload errors, the query is retried as long as it has not hit totalTimeout our or hit maxRetries. When retrying, the query will go the node holding the next replica per the Replica.xxx setting. Default is Replica.SEQUENCE.

1 Like

Got it. That makes sense. I will use Replica.PREFER_RACK for my performance test. Will share results.

Thanks @pgupta for the prompt and detailed responses. Highly appreciate it.

1 Like

Performed test comparing performance of Sindex with QueryPolicy Replica.SEQUENCE vs Replica.PREFER_RACK on an Aerospike Cluster of 3 Nodes (instance type: t4g.medium) equally distributed across 3 racks (AZs) with below configuration:

namespace test {
  rack-id 1
  replication-factor 128
  nsup-period 1h
  nsup-hist-period 1h
  memory-size 3G
  strong-consistency false
  high-water-disk-pct 80
  high-water-memory-pct 80
  stop-writes-pct 90
  partition-tree-sprigs 8192
  storage-engine device {
    device ...
    device ...
    read-page-cache true
    data-in-memory true
    write-block-size 128K
    max-write-cache 64M

Performance Test Parameteres

Total no. of records: 50 
bins: 2 (bin1, bin2)
Sindex: bin2
bin2 Cardinality: 2
Query rpm: 500
Total Time: 10 mins


p95 latency with Replica.SEQUENCE: 2.37ms
p95 latency with Replica.PREFER_RACK: 4.83ms

Note: The main goal of this was only to compare performance of Sindex w.r.t. different QueryPolicy.replica

Conclusion: The data justifies that querying using Replica.PREFER_RACK will have higher latency (almost 2x) compared to querying using Replica.SEQUENCE


1 Like

Thanks for sharing the test data, really appreciate it.

1 Like