I went though the online documents of Aerospike official website. One thing I didn’t figure out is how the primary key organized to support Set iteration?
As the document mentioned, a Namespace(NS) is created with 4096 partitions and a Record will be assigned to one of these partitions based on the hash value of Record key. Based on these, Set is like a logical concept than a physical. So records of a Set will be distributed to 4096 partitions of NS.
And the document says aerospike’s primary index is a RB-Tree. The key is digit surely and the value is index metadata(void time, write gen and storage addr). And Set is not mentioned here.
The document also says Entire keyspace in a set (table) is partitioned using a robust hash function into partitions. My guess is
for each partition, each Set of a NS has a individual RB-Tree as index. That means there are 4096 * NumberOfSet RB-Trees to form the whole primary index of a NS.
For each server, each Set of a NS has a individual RB-Tree as index. And in this way, the primary key of an NS should contains NumberOfClusterNode * NumberOfSet RB-Trees.
If in this way, how could aerospike scan a set effectively?
Suppose there is a NS has 1b records and there are two Sets. One set named “less” has only 1m records. To scan the Set, based on the primary index structure, how can aerospike only iterate the 1m records I needed but avoid to filter the full 1b records of the NS?
Statement stmt = new Statement();
stmt.setNamespace("persistusers30d");
stmt.setSetName("userstempset");
long start = System.currentTimeMillis();
RecordSet rs = client.query(null, stmt);
try {
while (rs.next()) {
Key key = rs.getKey();
System.out.println(key.toString() + "\t" + key.userKey);
Record record = rs.getRecord();
System.out.println(record.getValue("username"));
System.out.println(record.getValue("interests"));
}
} finally {
rs.close();
}
System.out.println(System.currentTimeMillis() - start);
It takes about 8s to finish. The set userstempset contains one record only. NS persistusers30d contains 27,008,609 records. Based on the primary index structure, there is no set included and the set name is not used to do filter. Really wanna know how aerospike be able to do this so fast. Where the Set name in the statement is used to do the filtering?
Even though the setname is being hashed together with the key, the information is still separately sent on the wire protocol, and remembered separately as part of the record. This is why filtering based on setname is still possible.
One more thing to make sure. Does aerospike iterate the whole primary index to filter set name or there is some search strategy to skip (Since this primary index is a RB tree…)?
It is iterating the whole primary index to get the records matching the set
As a side note, Aerospike iterates through the whole primary index periodically to expire data as well.