Select random records from a set

I want to select a sample of random ‘n’ records from a set in the namespace. Is there a way to achieve this in Aerospike Query Language?

In Oracle, we achieve something similar with the following query:

SELECT * FROM <table-name> sample block(10) where rownum < 101 

The above query fetches blocks of size of 10 rows from a sample size of 100.

I’m using the Aerospike Java Client and performing batch reads for the records. While I can perform a random sample selection of the records list in the code, I want to know if this sampling can be addressed directly from the aerospike query end.

The final goal is to get ‘n’ random records and use them for processing.

Not afaik, AQL is very basic. You can do that in client libraries using the ‘max records’ scan policy. I’m not sure why they haven’t added that yet, seems fairly straightforward and i think a lot of people would like that (myself included!) https://docs.aerospike.com/apidocs/java/com/aerospike/client/policy/ScanPolicy.html

Seems there is also an answer to your question on Stack Overflow.

BTW, loosely related, there is a new feature enable-index which makes scanning a set very efficient. I think you can then add some expressions on top of that to do any filtering and add the maxRecords in the policy to limit the number of records.

Since you cross-posted on Stackoverflow, I’ll plagiarize my answer from that post.

Rows are like records in Aerospike, and columns are like bins. You don’t have a way to sample random columns from a table, do you?

You can sample random records from a set using ScanPolicy.maxRecords added to a scan of that set. Note the new (optional) set indexes in Aerospike version 5.6 may accelerate that operation.

Each namespace has its data partitioned into 4096 logical partitions, and the records in the namespace evenly distributed to each of those using the characteristics of the 20-byte RIPEMD-160 digest. Therefore, Aerospike doesn’t have a rownum , but you can leverage the data distribution to sample data.

Each partition is roughly 0.0244% of the namespace. That’s a sample space you can use, similar to the SQL query above. Next, if you are using the ScanParition method of the client, you can give it the ScanPolicy.maxRecords to pick a specific number of records out of that partition. Further you can start after an arbitrary digest (see PartitionFilter.after ) if you’d like.

Ok, now let’s talk data browsing. Instead of using the aql tool, you could be using the Aerospike JDBC driver, which works with any JDBC compatible data browser like DBeaver, SQuirreL, and Tableau. When you use LIMIT on a SELECT statement it will basically do what I described above - use partition scanning and a max-records sample on that scan. I suggest you try this as an alternative.

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.