How is Aerospike primary key organized to iterate Set effectively?


I went though the online documents of Aerospike official website. One thing I didn’t figure out is how the primary key organized to support Set iteration?

As the document mentioned, a Namespace(NS) is created with 4096 partitions and a Record will be assigned to one of these partitions based on the hash value of Record key. Based on these, Set is like a logical concept than a physical. So records of a Set will be distributed to 4096 partitions of NS.

And the document says aerospike’s primary index is a RB-Tree. The key is digit surely and the value is index metadata(void time, write gen and storage addr). And Set is not mentioned here.

The document also says Entire keyspace in a set (table) is partitioned using a robust hash function into partitions. My guess is

  1. for each partition, each Set of a NS has a individual RB-Tree as index. That means there are 4096 * NumberOfSet RB-Trees to form the whole primary index of a NS.

  2. For each server, each Set of a NS has a individual RB-Tree as index. And in this way, the primary key of an NS should contains NumberOfClusterNode * NumberOfSet RB-Trees.

Which one is true or neither is?


Setname is simply part of the key. It is (prepended) as part of the key, before the key is hashed using RIPEMD160.

There are 4096 RB trees per namespace. Nothing separate for each set.

Will be modifying the doc to make it clearer.



If in this way, how could aerospike scan a set effectively?

Suppose there is a NS has 1b records and there are two Sets. One set named “less” has only 1m records. To scan the Set, based on the primary index structure, how can aerospike only iterate the 1m records I needed but avoid to filter the full 1b records of the NS?


I did a test on set iteration using

    Statement stmt = new Statement();
    long start = System.currentTimeMillis();
    RecordSet rs = client.query(null, stmt);
    try {
        while ( {
            Key key = rs.getKey();
            System.out.println(key.toString() + "\t" + key.userKey);
            Record record = rs.getRecord();
    } finally {
    System.out.println(System.currentTimeMillis() - start);

It takes about 8s to finish. The set userstempset contains one record only. NS persistusers30d contains 27,008,609 records. Based on the primary index structure, there is no set included and the set name is not used to do filter. Really wanna know how aerospike be able to do this so fast. Where the Set name in the statement is used to do the filtering?


Even though sets do not have their own partition trees, the set information is still saved as an integer in the record’s index structure.

When doing a scan to search for all data in a set, the index tree is iterated and only those matching the set will be returned.

An alternative is to create secondary index on the interested set, on an integer bin. Then a full range query will return all data in the set.


The 20 byte digit is hash (set + record key) or hash(set name) + hash (record key)?

If it is hash(set name) + hash (record key), I believe scan the index tree is very fast based on hash(set name).

Is there any document I can refer to?



It is hash (set+record key)

Please see computDigest() in Aerospike’s Java Library


In this way, how to do filtering in primary index tree just using a Set Name?


Even though the setname is being hashed together with the key, the information is still separately sent on the wire protocol, and remembered separately as part of the record. This is why filtering based on setname is still possible.


So the cost to iterate a set is Went though Primary Index with Set Name as filter + read expected record from storage, right?


yes, this is correct!


One more thing to make sure. Does aerospike iterate the whole primary index to filter set name or there is some search strategy to skip (Since this primary index is a RB tree…)?


It is iterating the whole primary index to get the records matching the set As a side note, Aerospike iterates through the whole primary index periodically to expire data as well.


many thanks for your detail explanation.