Partial Scans in Aerospike

Aerospike_Knowledge · August 3, 2020, 5:49am

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Partial Scans

Often for sampling purposes, users would need to scan a subset of their records in Aerospike cluster.

Using maxRecords (applicable for server version 4.9 and above)

It is recommended to use the maxRecords API rather than the scanPercent one when using server versions 4.9 and above. The main advantage, when comparing the two, is that maxRecords is entirely random and is more accurate in the number of records it returns.

How do I replace scanPercent in my code with the new API maxRecords when I upgrade the server from prior to 4.9?

In order to get a random percentage of the records using the maxRecords API, use an info call from the client in order to calculate number of records as a fraction of the total number of records in the cluster.

Provided I define a lastUpdate predicate for my scan with maxRecords = 5000, would that mean that the predicate would apply to the 5000 records or the other way around? In other words, would predicate evaluate to true on 5000 records or whether we would get the maxRecords number of records as the final output with all records evalulating to true?

Aerospike will apply the predicate expression filter on the records it processes and the final output will have the maxRecords number of records. Note that this is a change in behavior from using scanPercent (described below).

Using scanPercent (applicable for server versions prior to 4.9)

What is the order of records when scanning at scanPercent < 100? Will it be random?

The scan sub-system will scan the number of records in each partition based on the scanPercent value. For example, if 10 is specified Java:

                policy.scanPercent = 10;

Then 10% of all partition records are returned.

Are all nodes guaranteed to be scanned even for low scanPercent values?

All nodes will be scanned based on the specified scanPercent value as each partition itself would be scanned.

Does the LastUpdateTime/Expiration play a role in which records get returned first if I use scanPercent?

The LastUpdateTime/Expiration does not play a role in this return. Using the 10% example, the 10% scanPercent value is passed on the tree reduction so each partition will do 10%.

If the specified ScanPercent is low and the cluster only has a few records in the namespace or uneven distribution, the number of records can be inaccurate.

Provided I define a lastUpdate predicate for my scan with scanPercent = 10, would that mean that the predicate would apply to the 10% or the other way around? In other words, would predicate evaluate to true first and then 10% of that would be returned to the client?

Aerospike will apply the 10% first, then apply the predicate expression filter on that. This can result in a much lower number of records returned than expected and would be a reason to switch over to the maxRecords API.

References

Java client maxRecords and scanPercent

Keywords

PERCENTAGE SCANS RANDOM MAX RECORDS SCANPERCENT MAXRECORDS

Timestamp

July 2020

Topic		Replies	Views
In Aerospike, is it possible to do a scan with a limit in the Node client? Node.js Client	6	2815	February 4, 2016
Aerospike Java client Query returns only roughly half the total number of records in each set Java Client query , scan	3	316	January 4, 2024
scanAll() yields inconsistent results Java Client	4	1661	February 28, 2017
How to get top 1000 records at a time? C# Client	4	2091	December 17, 2014
Scan vs query with specified rps, filterExp and includeBinData = false How Aerospike Works	2	632	January 5, 2023