Often for sampling purposes, users would need to scan a subset of their records in Aerospike cluster.
Using maxRecords (applicable for server version 4.9 and above)
It is recommended to use the maxRecords API rather than the scanPercent one when using server versions 4.9 and above. The main advantage, when comparing the two, is that maxRecords is entirely random and is more accurate in the number of records it returns.
How do I replace scanPercent in my code with the new API maxRecords when I upgrade the server from prior to 4.9?
In order to get a random percentage of the records using the maxRecords API, use an info call from the client in order to calculate number of records as a fraction of the total number of records in the cluster.
Provided I define a lastUpdate predicate for my scan with maxRecords = 5000, would that mean that the predicate would apply to the 5000 records or the other way around? In other words, would predicate evaluate to true on 5000 records or whether we would get the maxRecords number of records as the final output with all records evalulating to true?
Aerospike will apply the predicate expression filter on the records it processes and the final output will have the maxRecords number of records. Note that this is a change in behavior from using scanPercent (described below).
Using scanPercent (applicable for server versions prior to 4.9)
What is the order of records when scanning at scanPercent < 100? Will it be random?
The scan sub-system will scan the number of records in each partition based on the scanPercent value. For example, if 10 is specified Java:
policy.scanPercent = 10;
Then 10% of all partition records are returned.
Are all nodes guaranteed to be scanned even for low scanPercent values?
All nodes will be scanned based on the specified scanPercent value as each partition itself would be scanned.
Does the LastUpdateTime/Expiration play a role in which records get returned first if I use scanPercent?
The LastUpdateTime/Expiration does not play a role in this return. Using the 10% example, the 10% scanPercent value is passed on the tree reduction so each partition will do 10%.
If the specified ScanPercent is low and the cluster only has a few records in the namespace or uneven distribution, the number of records can be inaccurate.
Provided I define a lastUpdate predicate for my scan with scanPercent = 10, would that mean that the predicate would apply to the 10% or the other way around? In other words, would predicate evaluate to true first and then 10% of that would be returned to the client?
Aerospike will apply the 10% first, then apply the predicate expression filter on that. This can result in a much lower number of records returned than expected and would be a reason to switch over to the maxRecords API.
PERCENTAGE SCANS RANDOM MAX RECORDS SCANPERCENT MAXRECORDS