How to manage high selectivity secondary indexes effectively within Aerospike


#1

FAQ How to manage high selectivity secondary indexes effectively within Aerospike

Background

Secondary indexes use a scatter/gather design to retrieve data meaning that queries will run on every node in the cluster. When the secondary index is highly selective (or returns very few records), this results in unecessary work, as the query runs on all nodes but the requested record will sit on one or a few nodes only.

Suggested Approach

It is suggested to create a reverse mapping between the secondary key and the primary key(s) within the same namespace such that rather than querying with the scatter/gather approach, the reverse mapping is used to do a simple key/value or a small batch transaction. This is illustrated as follows:

Main data set containing records

Records looks as follows A, B, C (primary keys) with comma separated bins and secondary index on the first bin:

A–> b,c,d,f B–> f,g,k,l C–> b,k,f,l

Lookup set containing a mapping between the unique secondary index bin and the primary keys

b–>A,C f–>B

The index query is then broken into two steps.

Step 1:

The secondary index query is sent to lookup set, which executes as a key/value store query and returns A and C, the primary keys for the main data set.

Step 2:

The primary keys are sent to the main data set which executes a subsequent batch transaction and returns the required records from the nodes where they reside rather than all nodes.

Points to consider

  • The application has to be coded to issue two index calls, this is not handled implicitly.
  • There is no link between the lookup set and the main data set within the database therefore the application is responsible for maintaining consistency