Delete records matching regex

Trying to create a client application which can find records in a set matching a regex - since there are 100 millions of records in the database (matching regex) it may not be feasible to do a scan or even use an expression filter.

Please suggest recommendations on how to do it without bringing down the cluster.

Thanks in advance, Sri

Hi Sri,

There are a couple of options depending on the use case. First, as you mention, you can scan the records and apply an expression filter. This will load all the records, apply the expression filter and return just the ones which match. Whilst this will have to inspect all 100m records, it will be done in parallel across the cluster and typically with multiple threads per node so performance should be reasonable.

Is there any way to narrow the search criteria down at all? For example with a date range or some other consideration which would allow a secondary index + expression filter to be used? Even if it was something like “80% of the time the users want to match a prefix with a wildcard at the end” we could optimize for the common case.

What sort of SLA do you have for the users and how often will this delete job be run?

Thanks,

Tim

Hi Tim,

Thanks for the response - the use case we are trying to address is as follows - want to shrink an existing set by deleting records matching a regex in the form "user:*’ (prefix matches). This is more of a one time task and not user facing.

Main concern is it should not have any performance impact on the cluster (application is a real time bidder so it shouldn’t encounter any latency issues) and shouldn’t cause any crashes, out of memory etc .

Also the client application shouldn’t consume lot of memory. (especially if all the matching records have to be shipped over the network)

Thanks, Sri

Tim,

What if we create a new set with the required values (instead of truncating the original set) - what would be the memory requirements on the server in this case ? And switch over to the new set once the data population is complete.

Thanks, Sri

Hi Sri,

Ah, ok. So there’s no need to optimize for performance, it’s actually the opposite – you deliberately want it to run slower to minimise impact on the rest of the cluster and the RTB process. Depending on your version of Aerospike, you can do this with a background PI query (scan) with an appropriate filter and a rate limit set in the statement.

For example, say I have a set customers and I want to delete everyone whose name bin starts with A through M. I can do this by:

Statement stmt = new Statement();
stmt.setSetName(SET_NAME);
stmt.setNamespace(NAMESPACE);
stmt.setRecordsPerSecond(50);
WritePolicy wp = new WritePolicy(client.getWritePolicyDefault());
wp.filterExp = Exp.build(Exp.regexCompare("^[A-M]", RegexFlag.ICASE, Exp.stringBin("name")));
ExecuteTask task = client.execute(wp, stmt, Operation.delete());
task.waitTillComplete();

This will launch a background scan which looks at about 50 records per second, runs a filter expression across each record and deletes any which match the passed expression. Note that this all takes place on the server – none of these records are sent back to the client. And as it’s processing record-by-record, there are no concerns on memory or any other critical resources.

I decided to wait until the scan was complete so I could check the results, but the task.waitTillComplete() is optional depending on your use case.

This is the best way to do it in my opinion. We can explore the multiple set idea if you don’t like this technique, but this way is clean and simple. It will take some time to run, but that’s by design.

Thanks

Tim