Primary key iterator and seek functions sorely missed

secondary
index

#1

I’ve been doing detailed prototyping on Aerospike. Our product currently uses an open source Java variant of LevelDB, which works well for small to medium sized data sets, but will not scale to anything approaching ‘big data’. I was very impressed by Aerospike’s ease of installation and configuration, as well as its performance and clear scalability. It also offers the basics I was looking for - put and get key value pair operations where the key can be a string, integer or byte array.

However one fundamental that is badly lacking, in my opinion, is the area of primary key iteration. We need (and I anticipate that this is a universal requirement, and these functions are ubiquitous in competitor products) the ability to iterate a set using a primary key. Furthermore we need a ‘seek’ function which places an iterator at the first primary key value equal to or higher than the seek target value. This needs to work for String keys, not just for integer keys.

These seem to be to be functions that are universally available in other non SQL database products. I have spent a lot of time googling as to how to fulfil these requirements in other ways using Aerospike. Obviously record sets and queries are the way to go, and I will be able to make a go of it using various workarounds, but even there I am hampered by the fact that ranges cannot be set for String keys. I also have to mess about creating secondary indices representing the primary keys which just seems plain silly and expensive overhead wise.

I cannot understand why these to my mind fundamental capabilities are not provided. I have seen references to the primary index being organised using BTrees etc. so positioning an iterator at a node in a tree and then traversing the rest of the nodes in that tree in alphabetical or numerical order depending upon the key type does not sound like a Labour of Hercules to me.

It’s a real shame, and will probably lead to us not being able to use Aerospike.


#2

See if you can attend one of the upcoming free Aerospike architecture webinars online. It will help understand the why.

Aerospike does not store your key. It stores the hash of your set name+your key. The b-trees are per partition per namespace and are traversed to find the primary index of the individual record you are seeking. The primary index stores this hash value besides pointers to the data location. 12bits in this hash value yield the partition ID of this record–> There are 4K partitions per namespace and the partition are pretty evenly distributed across all nodes in your cluster. Range reads/writes based on your primary key have to be orchestrated at the client level and will be processed one record at a time. Aerospike has a record level lock.


#3

I am not so sure that explaining the internals is useful, especially in the context of explaining why (what I consider to be) fundamental capabilities are missing.

Do you agree that a primary key iterator combined with a seek function that works for both integer and key strings would be useful?

If you do agree, then I would like to know how best to achieve those aims with Aerospike, or an indication of when these capabilities might be available.

If not (i.e. you do not consider these functions to be useful) then I would be interested in your reasoning as that might help me to understand other ways of achieving the same ends.


#4

Iterating keys isn’t as ubiquitous of a requirement, or Aerospike wouldn’t have customers. In a database that was designed to run on a single node you definitely can come up with such a mechanism. However, Aerospike has always been a distributed database (which is why it has superior clustering to the single-node variants). Even if your keys have some chronological order to them, they’ll end up evenly spread across the nodes, which is a feature of a good distributed database. Even distribution of data is crucial for ensuring even capacity utilization and workload distribution on the nodes.

You’re assigning a value meaning to the key, when it’s supposed to be an identifier for the record. On each of the nodes, the primary index is actually organized as a series of red-black trees (sprigs). The way the keys appear in those red-black trees has to do with order of insertion, not the value of the key. Plus, the key that you use in your application is hashed along with the set through RIPEMD160, producing a 20B digest that is the actual unique identifier of the row. As Piyush mentioned, understanding the architecture isn’t something to wave away when it comes to why certain things work really well (speed at scale) and why some don’t (ordering).

Currently you can leverage predicate filtering to do range matching on integer values and regular expressions on strings, if you store your key as a bin. A modulo operator can be applied to the digest of the key. It’s likely that more functionality will be added to predicate filters - for example there’s a feature request for partial key matching. Predicate filtering does not require that you have secondary indexes, you can also apply them to a scan.

If you’re using the Java client take a look at the PredExp class and its examples. This is also supported by the C, C# and Go clients.


#5

The application logic issue boils down to needing the capability to use predicates (I have already used regex), in such a way that I can specify a value with operators for a string that has been set up as a secondary index along the lines of >= “a”. My main issue at the moment is the fact that the only String filter is ‘equal’, and that regex does not have anything like >= . Possibly this is where partial key matching would come in?


#6

As I mentioned, extending predicate filtering to cover range matching for strings is what you may need. It’s a legitimate, existing feature request, but you can add your voice by asking for it on the aerospike/aerospike-client-java side, or as an issue in the aerospike/aerospike-server repo.

Until then, this can be handled as a stream UDF. There are many similar questions on stackoverflow.