FAQ - Scans in Aerospike

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Note

Version 4.7 of Aerospike introduced significant changes to the Aerospike scan sub-system. This FAQ covers the Aerospike scan sub-system prior to version 4.7; for versions since 4.7, please refer to FAQ - Scans in versions 4.7 and above.

Synopsis

Scans are one of the commonly used operation in Aerospike. This knowledge-base covers some frequently asked questions about Scans in Aerospike to effectively implement them in your use-case.

- What are the different kinds of scans in Aerospike?

There are 4 types of scans that once can see in Aerospike:

1) Basic Scans:

  • This simply scans all records in a namespace or set.
  • The following command is an example of a basic scan using AQL:
    • aql> select * from <ns>.<set>
  • Basic scans are started by a client, and they have an open socket back to the client. Basic scans can return output to the client. If the client times out or is unable to accept data returned by the server for a scan in progress, it can block the network socket in question and eventually risks temporarily halting the scan subsystem.

2) Aggregation Scans:

  • Aggregation scans are like basic scans but they are expected to return some post-operation result as the output via a Stream UDF.
  • Aggregation scans are started by a client, and they have an open socket back to the client.
  • Aggregation scans can return output to the client. Similar disclaimer for potential stuck scans created holds true for the aggregation scan.

3) UDF Background Scans:

  • UDF Background scans go through all records in a set or namespace and apply the UDF to the records.
  • UDF Background scans do not have a client socket open. They are disconnected, and do not return output to a client.
  • You must monitor the scans to check progress.

4) Sindex builder Scans:

  • This takes place only when a secondary index is created as the records will need to be read in order to populate the secondary index that gets stored in memory.
  • This is initiated as a part of a client creating a secondary index.

- How does the scan subsystem work?

Versions 3.6.0 and above

On the server side, the scan subsystem has a dedicated set of threads that are shared by all scans on a node. The scan thread pool on the server side is governed by a dynamic configuration, scan-threads. The default value for scan-threads is 4, and the maximum value is 32. scan-threads is the only throttle for basic scans. If this value is set too high, scans related activity could interfere with other read or write transactions. If set too low, scans might languish.

The scans are interleaved and are queued up. One thread takes a Job with a partition ID off of the scan queue and scans that partition. Multiple scans can run at once with no contention selecting a job each in round-robin fashion. Each scan scans one partition at a time with no contention.

Scan on a node only processes the master partitions. If a partition is not master, the scan skips that partition.

Note: UDF background scans generate internal UDF transactions. The UDF transactions go to the transaction system. UDF background scans are governed by the scan-max-udf-transactions configuration. The default value for scan-max-udf-transactions is 32. It can dynamically be changed. When the number of UDF transactions meets this value, the transactions sleep. When the number of UDF transactions drops below the scan-max-udf-transactions value, transactions resume. Please refer to the scan udf throttling guide for additional information.

For example, letā€™s consider scan-max-udf-transactions set at 32. For a UDF scan, the scan thread will queue 32 transactions to the transaction queue. When attempting to queue a 33rd transaction, if none of the first 32 transactions has completed, the scan thread will sleep prior to re-attempting that transaction.

Versions prior to 3.6.0

In the old scans, scans waited for their turn. The second scan could not start before the first scan completed. The third scan waited on the second scan, and so on. This was a huge limitation as a slow bigger scan could easily exhaust the resources and halt the subsequent scans.

- How many parallel scans can be run against a cluster?

Scans are triggered via a client (for example, Java, Python or even AQL). There is no set limit for the number of scans that can be requested from the client side. However, from the server side this is controlled by the configuration scan-max-active which is set to 100 by default at a per-node basis. If the number of running scans equals the value for scan-max-active, and you submit one more scan, the database returns the following error: ā€œAS_PROTO_RESULT_FAIL_FORBIDDENā€.

- I have a 20-node cluster and I issued a scan from the client side. But, why do I see active scans on 16 of the nodes in the cluster?

By default, the number of parallel scan requests a client can issue is 16 (configured by thread pool size on the client side and with the ā€˜-zā€™ option on the AQL side). To run a scan on more than 16 nodes at a time, you would need to update the client policy (E.g. API in Java.)

The scan on the C Client library is using a fixed sized sync thread pool (default size = 16) to issue node scans in parallel. When the scan callback returns false, all 16 current node scans are stopped, but some node scans may not have started yet. Unfortunately, C client versions prior to 4.3.11 start these node scans and then check for abort after parsing the first record. This is inefficient. In C client versions 4.3.11 and later a check for abort is done before starting any new node scans.

- Is a scan taskID configurable?

The scan taskId is randomly generated internally and not currently configurable.

- How do I set priority for a scan if I have multiple scans?

Versions 3.6.0 and above

To run scans with different priority, you can change the set-priority for the relevant scan job to have the priority jobs managed on a different queue but still undertaken by the same scan-threads.

asinfo -v 'jobs:module=scan;cmd=set-priority;trid=<jobid>;value=3'

Each scan job lives in one of three priority queues, High, Medium, and Low:

|     High      |    Medium     |      Low      |
|  - - - - - -  |  - - - - - -  | - - - - - - - |
| Partition IDX | Partition IDY | Partition IDX |
| Partition IDY | Partition IDZ | Partition IDY |

If the high priority queue is empty, scans work from the medium priority queue. If the medium queue is empty, scans work from the low priority queue. Scan priority only matters if more than one scan job is running at a time.

When issued simultaneously and with the same priority, scans on smaller namespaces will not necessarily finish faster than scans on larger namespaces. This is because scans will pick up partitions from the different pending jobs in a round robin fashion, the progress will be interleaved.

Versions prior to 3.6.0

Refer to the managing scan page for details on setting priority.

- I issued 150 scans which completed, but why do I see only 100 when I check statistics on the Aerospike server?

scan-max-done controls the number of completed scans held for monitoring. The default value is 100. It stores scan details for later review. Completed scans contain information on when the scan started, when it completed, how long ago it completed, what it found, how many bytes it sent over the network.

- My set of 10 records takes over 5 minutes on each node. How can this simple operation be so slow?

Note: In versions 5.6 and above, the enable-index configuration parameter can be leveraged to create a special index for a set to be scanned efficiently.

Scanning a set will have to scan the whole namespace, therefore, whether there is 1 record in a set or whether the set covers the whole namespace, the scan will regardless have to go through every record in the namespace (the index part) and then return the records matching the set.

This is because the set information is stored along with each records, both in their index and along with the record itself (memory or disk based on the configuration). When a scan is issued against a set in a namespace, the whole namespace has to be reduced, partition by partition in order to check, for each entry, whether it belongs to the set being scanned. Reducing a partition means traversing it entirely.

Note: Reducing a partition requires acquiring a lock which may slow down new record creations and record deletions within the same partition. In versions 3.11 and above, the introduction of extra locks per partitions under partition-tree-locks drastically reduces such contentions.

Depending on your use case, you may want to model things differently based on this. You could find some related information in this article:

Additionally, along with partition reduction phase which will be same for each namespace scan, we have the additional step of fetching the records and returning them to the client. This step could take longer depending on how large a set is, how the storage subsystem is keeping up and how fast the client is consuming what is being returned. Thus, not all scans would take the same amount of time.

- Would the partition reduction phase for a namespace with 100 sets that were 1GB each be the same as a namespace with 100 sets that were 100GB each? Would a namespace that has 10 sets of 100GB each actually reduce faster than 100 sets that are 10GB each?

  • If the namespace has the same number of total records and the set is the same size, the scan will take as long in both cases.
  • If you are issuing multiple scans in parallel, the size of each set can impact how fast you would go overall given the limited number of scan-threads.

- When a scan is issued on a larger set size (for example 100,000,000 records), it sees much less CPU usage compared to a scan issued on a smaller set size (for example 1,000,000 records). This is observed on a namespace where the data is stored on disk and index is stored in memory.

During a scan, scan threads reduce (traverse) the index partition by partition to find the records matching the set. When records matching the set are found, the scan thread fetches the record from disk (unless it is found in the read-page-cache - if enabled - or the post-write-queue) and saves it in a buffer before proceeding to reduce the index further. This process completes when all the master partitions owned by each node are scanned for that namespace.

In the case with a larger set, it is expected to find more matching records while reducing the index, therefore pausing more frequently to fetch the data from disk. This explains why the CPU (or the number of cores matching the scan threads, to be more precise) will not be continuously utilised, considering the I/O operations frequently involved.

In the case with a smaller set, as there would be much less records found, the scan threads will be mostly reducing the index, which is CPU intensive. This will result in a number of CPU cores (matching the number of scan threads) to be fully utilized. The time taken to scan the smaller set would of course be much faster than the time taken for scanning the larger set.

Note: At the time where this article is updated, Aerospike is working on a new feature to allow scanning of sets to not have to reduce the whole index for the namespace they belong to.

- Why do I get a timeout when I scan a set of 10 records when my set of 100 million records works fine and does not timeout?

aql> select * from namespace.set
Error: (9) Timeout: timeout=1000 iterations=1 failedNodes=0 failedConns=0
  • AQL timed out because it did not get any data for > 1 sec. Now, with smaller sets this might be more probable than a larger set as not all partitions reduced contained records from the small set. Increasing the timeout would help here.

- How do I decide on a scan timeout?

A adequate timeout for a scan would depend on the factors that impact a scan performance - Disk I/O, number of parallel ongoing scans, network between client and server and of-course the use case (how long is a valid timeout for the specific use case). So, yes you would need to tune the timeout if the scale of the namespace increases. There are other configuration tunings that are available on the server side as well (scan-threads, scan-max-active) which may help before having to increase the timeout, but you might not want to give scans all the resources for the cluster in order to prevent interference with the other read/write transactions.

- How do I decide on a retry setting for a scan?

Always set retries to 0 in your client policy for any scan job. A retry is not likely to work and could result in duplicate records being returned. See Why do I get a warning - ā€œjob with trid X already activeā€ when issuing a scan? for details.

Notes

Keywords

small scan slow partition timeout retry retries

Timestamp

May 2021

1 Like