Scans are often used within Aerospike. A scan is an operation to retrieve a potentially large number of records without specifying their primary keys or digests. The easiest way to think of a scan is the following select statement in AQL (where
test is a namespace name):
AQL> select * from test
When such a statement is executed, scan threads read every index entry within the primary index for that namespace. This can mean that scans are time-consuming, and can have a pronounced effect on system resources. A further issue is that, in many cases, a user may not want to scan a whole namespace and, in fact, may only want to retrieve records from a certain set. As a set is not a physical construct and is actually only an attribute of a record, this can lead to inefficiencies. Set membership is listed within the index entry for a record in the primary index. This means, to scan all records in a given set every single index entry for the namespace has to be read. This is called ‘reducing’ the index. When there are sets which are small in comparison to the overall size of the namespace this leads to a lot of resource and time spent to retrieve a proportionally small number of records. Set Indexes have been introduced in Aerospike 5.6 to address these inefficiencies. This article discusses some of the basic information around them.
A Set Index can be considered a shadow of the Primary Index. Like the Primary Index they exist on all nodes within the cluster. The Primary Index is composed of a number of RB Tree structures for each partition, called sprigs. The minimum amount of sprigs per partition is 256 (configurable through
partition-tree-sprigs) which implies a minimum of around 1 million sprigs per namespace. Each element in the RB Tree or sprig is a 64-byte index entry which consists of the 20-byte digest and various other pieces of index metadata (such as, if applicable, record location on disk, generation, void time, etc). The digest is a hash of the key and the set name if the record has one. The primary index also has 256 locks available per partition, which, again, leads to around 1 million locks per namespace. Unlike the number of sprigs per partition, this is not configurable. A set index has the same RB Tree or sprig structure and a fixed number of 256 sprigs per index per partition. Locks are also fixed at 256 per partition. A Set Index, therefore, is a subset of the Primary Index containing only records from a given set and nothing else. Like the Primary Index, Set Indexes are built-in RAM. Unlike the Primary Index, Set Indexes do not exist in shared memory and so are rebuilt on startup. Storing Set Indexes in shared memory is not necessary as they are extremely fast to rebuild. The Set Index contains a reference back to the entry for that record in the Primary Index as well as any child elements. It also contains a small fragment of the record digest that helps optimize tree manipulations.
When a scan runs against a set that has an index enabled, it reduces the Set Index and, for each element, follows the reference to the Primary Index to retrieve the full index entry and locate the record.
Set Indexes are useful when an application is scanning a set that is small in comparison to the namespace it is contained within. The scan will automatically use the Set Index when it is present. Given that the Set Index contains only records that belong to the set, there is no loss of efficiency in scanning records that are not required by the application. This allows for a far higher volume of scans to be run with the same set of resources.
Like the Primary Index, set indexes are sorted by digest. Records within the set are densely packed within the Set Index and so scanning is fast. The entry in a Set Index is 16 bytes and contains a 5-byte reference to the Primary Index. Although a Set Index entry only contains a fragment of the record digest, the fragment is sufficient to maintain effective ordering.
The entry into the Set Index is made slightly after the entry into the Primary Index to ensure that the record has been inserted. As the digest fragment within a Set Index entry is sufficient to preserve order, it is not necessary to refer back to the Primary Index to determine where a given entry should go within the index when inserting into or manipulating the Set Index. This behavior is the same on both master and replica nodes for a given record.
Yes, Set Indexes can be switched on and off dynamically using the
enable-index command. They can be switched on or off mid-scan. As the ordering is the same as the Primary Index, if they are switched on or off the scan can switch from one to the other easily as the progress of the scan will be the same.
No special client-side configurations have to be made. If a Set Index exists for a given set, the server will scan that Set Index rather than the Primary Index. If there is no Set Index, the Primary Index is used as normal.
Set Indexes do not contain tombstones at the time of writing this article (version 5.6). Tombstones are used within Aerospike to create a marker for a deleted record so that the record does not return to the cluster upon cold restart. Set Indexes are not used to repopulate the Primary Index on startup and so tombstones are not required. This does mean that certain operations which reduce the Primary Index, like
truncates will not use Set Indexes as
truncates would remove tombstones in addition to live records. When a delete is issued, a tombstone is created in the Primary Index but the record is simply removed from the Set Index.
Yes, Set Indexes can be used with an all-flash namespace. It is perhaps even more valuable to use a Set Index with all-flash as reducing the primary index would imply significant disk I/O whereas the Set Index exists in RAM. There are certain caveats that should be considered which are discussed in a subsequent section.
Like the Primary Index, Set Indexes allocate memory in arenas. Unlike the Primary Index, which uses an arena stage size of 1 GiB, Set Indexes use an arena stage size of 4 KiB. On a per-element basis the Set Index is about 1/4 the size of the Primary Index. Each partition will allocate the initial 4 KiB stage on index creation and, given the 4096 partitions in total, this implies an initial allocation (for the index across the whole cluster) of 16 MiB and allows indexing of around 1 M records (as each record would use 16 bytes for the Set Index). The advantage to the 4 KiB allocation is that memory growth as the index fills up is far more granular and easier to manage. The memory used by Set Indexes within the namespace is detailed in the standard memory usage log line that looks as follows:
This is also reflected in the stat
No. A Set Index will never reduce performance however there are certain situations where the Set Index will not make a huge performance impact and could be considered to be a waste of resources. The key use case is for a set that is small in comparison to the overall number of records in the namespace, however, this is not a panacea. The assumption that the Set Index will be effective in that scenario presumes that the chief bottleneck is in reducing the Primary Index in which set membership may be sparse. If the bottleneck is not in the index but instead in reading large records back off a storage device, the optimization may make little difference and therefore the memory spent on the index may be considered wasted. The same can be said to be true if the main time spent in the scan is streaming records back to the client over the network.
When the index is stored on disk (all-flash) the density of records belonging to the set within the sprigs is also a factor. When reading the Set Index the scan will refer back to the Primary Index to find set members it locates within the Set Index. This means the sprig containing those index entries has to be recalled from disk. To recall a sprig from disk, the entire sprig has to be read back which is (if the sprigs per partition are correctly sized) a single unit of I/O. If the density of set members per sprig is higher, there may be minimal advantage to using a Set Index as the I/O in recalling a sprig has to be done, whether or not a Set Index is in use. If the members per sprig density is much lower than the ‘cost’ of RAM may well be worth paying as it would speed up scans significantly.
In all of these circumstances, testing is easy due to the dynamic nature of Set Indexes.
- Aerospike Primary Index structure.
- Details on Aerospike tombstones and Durable Deletes.
- The number of records in a given set can be returned using the
SET INDEX NAMESPACE SCAN