Scan jobs appear to stall during set deletion


#1

Scan jobs appear to stall during set deletion

Aerospike Server version 3.12.1 or above

As of version 3.12.1, released in April 2017, set-delete is deprecated. Truncate the new feature of deleting all the data in a namespace or set is now supported in the database.

Scan jobs will not appear hung when using the truncate command.

See the following truncate documentation: http://www.aerospike.com/docs/reference/info#truncate

Truncate can also be executed from the client APIs. Here is the java documentation: http://www.aerospike.com/apidocs/java/com/aerospike/client/AerospikeClient.html

Aerospike Server version 3.12.1 or below

Problem description

When deleting sets in production in Aerospike 3.6.4 to 3.12.1 and using the java based set delete utility (http://www.aerospike.com/launchpad/deleting_sets_and_data.html), scans running against the cluster may appear to hang.

Explanation

This is because the java set delete is also running a scan and is potentially running at a higher priority than the read scan issued by the application. From Aerospike 3.6.4 onwards scans can run at 3 priorities, low, medium and high. Each of these priorities has an associated queue on which the scan job sits. A finite number of scan threads (by default 4) pull jobs from the queues and run them against master partitions to determine which records are affected. A thread takes a job from a job queue and scans the index for the first master partition it finds to determine affected records. It then puts the job back on the queue for the next scan thread to take and process through the next master partition. What this means is that if there are scan jobs sitting in the high priority queue, threads will work on these rather than processing jobs in the lower priority queue.

If incoming scans are running at a lower priority they may appear to be hanging as all configured scan threads are working on high priority jobs, in this case the set deletion.

Solution

The right solution to this issue will depend on use case. It may be permissible for other scans to be held up while set-deletes are running in which case no action need be taken. It may be that other scans should fail when this happens in which case the client could be modified to include a timeout.

Another option would be to change the priority of the read scans. Is this can be done dynamically from asinfo

http://www.aerospike.com/docs/operations/manage/scans/

One further suggestion would be to use the asinfo command to delete the set. This does not run a scan job but instead relies on the Namespace Supervisor process to remove the records within the set.

Note

This behaviour is not limited to the java set delete utility, it would be the same with any scan that is being used to delete records from the cluster.