Select random records from a set

shreya · May 12, 2021, 4:41am

I want to select a sample of random ‘n’ records from a set in the namespace. Is there a way to achieve this in Aerospike Query Language?

In Oracle, we achieve something similar with the following query:

SELECT * FROM <table-name> sample block(10) where rownum < 101

The above query fetches blocks of size of 10 rows from a sample size of 100.

I’m using the Aerospike Java Client and performing batch reads for the records. While I can perform a random sample selection of the records list in the code, I want to know if this sampling can be addressed directly from the aerospike query end.

The final goal is to get ‘n’ random records and use them for processing.

Albot · May 15, 2021, 12:45am

Not afaik, AQL is very basic. You can do that in client libraries using the ‘max records’ scan policy. I’m not sure why they haven’t added that yet, seems fairly straightforward and i think a lot of people would like that (myself included!) https://docs.aerospike.com/apidocs/java/com/aerospike/client/policy/ScanPolicy.html

meher · May 17, 2021, 11:53pm

Seems there is also an answer to your question on Stack Overflow.

BTW, loosely related, there is a new feature enable-index which makes scanning a set very efficient. I think you can then add some expressions on top of that to do any filtering and add the maxRecords in the policy to limit the number of records.

rbotzer · May 22, 2021, 4:20am

Since you cross-posted on Stackoverflow, I’ll plagiarize my answer from that post.

Rows are like records in Aerospike, and columns are like bins. You don’t have a way to sample random columns from a table, do you?

You can sample random records from a set using ScanPolicy.maxRecords added to a scan of that set. Note the new (optional) set indexes in Aerospike version 5.6 may accelerate that operation.

Each namespace has its data partitioned into 4096 logical partitions, and the records in the namespace evenly distributed to each of those using the characteristics of the 20-byte RIPEMD-160 digest. Therefore, Aerospike doesn’t have a rownum , but you can leverage the data distribution to sample data.

Each partition is roughly 0.0244% of the namespace. That’s a sample space you can use, similar to the SQL query above. Next, if you are using the ScanParition method of the client, you can give it the ScanPolicy.maxRecords to pick a specific number of records out of that partition. Further you can start after an arbitrary digest (see PartitionFilter.after ) if you’d like.

Ok, now let’s talk data browsing. Instead of using the aql tool, you could be using the Aerospike JDBC driver, which works with any JDBC compatible data browser like DBeaver, SQuirreL, and Tableau. When you use LIMIT on a SELECT statement it will basically do what I described above - use partition scanning and a max-records sample on that scan. I suggest you try this as an alternative.

system · August 14, 2021, 4:21am

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get any "N" number of records from a set in namespace? How Aerospike Works key , record	2	2083	November 23, 2015
Querying for records with a large number of keys Java Client query , java , client	7	298	June 20, 2024
Aerospike Java client Query returns only roughly half the total number of records in each set Java Client query , scan	3	601	January 4, 2024
How to get top 1000 records at a time? C# Client	4	2239	December 17, 2014
In Aerospike, is it possible to do a scan with a limit in the Node client? Node.js Client	6	2884	February 4, 2016

Select random records from a set

Related topics